Re: spark cluster setup

2015-08-02 Thread Sonal Goyal
Your master log files will be on the spark home folder/logs at the master
machine. Do they show an error ?

Best Regards,
Sonal
Founder, Nube Technologies 
Check out Reifier at Spark Summit 2015






On Mon, Aug 3, 2015 at 9:27 AM, Angel Angel  wrote:

> Hi,
>
> i have attached the snapshot of console.
> actually i don't know how to see the Master logs.
> still i have attache  the my master web UI.
>
> and the is log file errors.
>
>
>
>
> 2015-07-23 17:00:59,977 ERROR
> org.apache.spark.scheduler.ReplayListenerBus: Malformed line: 
>
>
> 2015-07-23 17:01:00,096 INFO org.eclipse.jetty.server.Server:
> jetty-8.y.z-SNAPSHOT
>
> 2015-07-23 17:01:00,138 INFO org.eclipse.jetty.server.AbstractConnector:
> Started SelectChannelConnector@0.0.0.0:18088
>
> 2015-07-23 17:01:00,138 INFO org.apache.spark.util.Utils: Successfully
> started service on port 18088.
>
> 2015-07-23 17:01:00,140 INFO
> org.apache.spark.deploy.history.HistoryServer: Started HistoryServer at
> http://hadoopm0:18088
>
> 2015-07-24 11:36:18,148 INFO org.apache.spark.SecurityManager: Changing
> view acls to: spark
>
> 2015-07-24 11:36:18,148 INFO org.apache.spark.SecurityManager: Changing
> modify acls to: spark
>
> 2015-07-24 11:36:18,148 INFO org.apache.spark.SecurityManager:
> SecurityManager: authentication disabled; ui acls disabled; users with view
> permissions: Set(spark); users with modify permissions: Set(spark)
>
> 2015-07-24 11:36:18,367 INFO org.apache.spark.SecurityManager: Changing
> acls enabled to: false
>
> 2015-07-24 11:36:18,367 INFO org.apache.spark.SecurityManager: Changing
> admin acls to:
>
> 2015-07-24 11:36:18,368 INFO org.apache.spark.SecurityManager: Changing
> view acls to: root
>
>
> Thanks.
>
>
> On Mon, Aug 3, 2015 at 11:52 AM, Sonal Goyal 
> wrote:
>
>> What do the master logs show?
>>
>> Best Regards,
>> Sonal
>> Founder, Nube Technologies
>> 
>>
>> Check out Reifier at Spark Summit 2015
>> 
>>
>>
>> 
>>
>>
>>
>> On Mon, Aug 3, 2015 at 7:46 AM, Angel Angel 
>> wrote:
>>
>>> Hello Sir,
>>>
>>> I have install the spark.
>>>
>>>
>>>
>>> The local  spark-shell is working fine.
>>>
>>>
>>>
>>> But whenever I tried the Master configuration I got some errors.
>>>
>>>
>>>
>>> When I run this command ;
>>>
>>> MASTER=spark://hadoopm0:7077 spark-shell
>>>
>>>
>>>
>>> I gets the errors likes;
>>>
>>>
>>>
>>> 15/07/27 21:17:26 INFO AppClient$ClientActor: Connecting to master
>>> spark://hadoopm0:7077...
>>>
>>> 15/07/27 21:17:46 ERROR SparkDeploySchedulerBackend: Application has
>>> been killed. Reason: All masters are unresponsive! Giving up.
>>>
>>> 15/07/27 21:17:46 WARN SparkDeploySchedulerBackend: Application ID is
>>> not initialized yet.
>>>
>>> 15/07/27 21:17:46 ERROR TaskSchedulerImpl: Exiting due to error from
>>> cluster scheduler: All masters are unresponsive! Giving up.
>>>
>>>
>>>
>>> Also I have attached the my screenshot of Master UI.
>>>
>>>
>>> Also i have tested using telnet command:
>>>
>>>
>>> it shows that hadoopm0 is connected
>>>
>>>
>>>
>>> Can you please give me some references, documentations or  how to solve
>>> this issue.
>>>
>>> Thanks in advance.
>>>
>>> Thanking You,
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>


Extremely poor predictive performance with RF in mllib

2015-08-02 Thread pkphlam
Hi,

This might be a long shot, but has anybody run into very poor predictive
performance using RandomForest with Mllib? Here is what I'm doing:

- Spark 1.4.1 with PySpark
- Python 3.4.2
- ~30,000 Tweets of text
- 12289 1s and 15956 0s
- Whitespace tokenization and then hashing trick for feature selection using
10,000 features
- Run RF with 100 trees and maxDepth of 4 and then predict using the
features from all the 1s observations.

So in theory, I should get predictions of close to 12289 1s (especially if
the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous to
me and makes me suspect something is wrong with my code or I'm missing
something. I notice similar behavior (although not as extreme) if I play
around with the settings. But I'm getting normal behavior with other
classifiers, so I don't think it's my setup that's the problem.

For example:

>>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>> logit_predict = lrm.predict(predict_feat)
>>> logit_predict.sum()
9077

>>> nb = NaiveBayes.train(lp)
>>> nb_predict = nb.predict(predict_feat)
>>> nb_predict.sum()
10287.0

>>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>> rf_predict = rf.predict(predict_feat)
>>> rf_predict.sum()
0.0

This code was all run back to back so I didn't change anything in between.
Does anybody have a possible explanation for this?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Checkpoint file not found

2015-08-02 Thread Anand Nalya
Hi,

I'm writing a Streaming application in Spark 1.3. After running for some
time, I'm getting following execption. I'm sure, that no other process is
modifying the hdfs file. Any idea, what might be the cause of this?

15/08/02 21:24:13 ERROR scheduler.DAGSchedulerEventProcessLoop:
DAGSchedulerEventProcessLoop failed; shutting down SparkContext
java.io.FileNotFoundException: File does not exist:
hdfs://node16:8020/user/anandnalya/tiered-original/e6794c2c-1c9f-414a-ae7e-e58a8f874661/rdd-5112/part-0
at
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1132)
at
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124)
at
org.apache.spark.rdd.CheckpointRDD.getPreferredLocations(CheckpointRDD.scala:66)
at
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:230)
at
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:230)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1324)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1331)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1331)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1334)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1333)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferre

Re: Cannot Import Package (spark-csv)

2015-08-02 Thread Ted Yu
I tried the following command on master branch:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --jars
../spark-csv_2.10-1.0.3.jar --master local

I didn't reproduce the error with your command.

FYI

On Sun, Aug 2, 2015 at 8:57 PM, Bill Chambers <
wchamb...@ischool.berkeley.edu> wrote:

> Sure the commands are:
>
> scala> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("cars.csv")
>
> and get the following error:
>
> java.lang.RuntimeException: Failed to load class for data source:
> com.databricks.spark.csv
>   at scala.sys.package$.error(package.scala:27)
>   at
> org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
>   at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
>   ... 49 elided
>
> On Sun, Aug 2, 2015 at 8:56 PM, Ted Yu  wrote:
>
>> The command you ran and the error you got were not visible.
>>
>> Mind sending them again ?
>>
>> Cheers
>>
>> On Sun, Aug 2, 2015 at 8:33 PM, billchambers <
>> wchamb...@ischool.berkeley.edu> wrote:
>>
>>> I am trying to import the spark csv package while using the scala spark
>>> shell. Spark 1.4.1, Scala 2.11
>>>
>>> I am starting the shell with:
>>>
>>> bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars
>>> ../sjars/spark-csv_2.11-1.1.0.jar --master local
>>>
>>>
>>> I then try and run
>>>
>>>
>>>
>>> and get the following error:
>>>
>>>
>>>
>>> What am i doing wrong?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Bill Chambers
> http://billchambers.me/
> Email  | LinkedIn
>  | Twitter
>  | Github 
>


Re: Cannot Import Package (spark-csv)

2015-08-02 Thread billchambers
Commands again are:

Sure the commands are:

scala> val df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("cars.csv")

and get the following error: 

java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
  at scala.sys.package$.error(package.scala:27)
  at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
  at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
  ... 49 elided



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109p24110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cannot Import Package (spark-csv)

2015-08-02 Thread Ted Yu
The command you ran and the error you got were not visible.

Mind sending them again ?

Cheers

On Sun, Aug 2, 2015 at 8:33 PM, billchambers  wrote:

> I am trying to import the spark csv package while using the scala spark
> shell. Spark 1.4.1, Scala 2.11
>
> I am starting the shell with:
>
> bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars
> ../sjars/spark-csv_2.11-1.1.0.jar --master local
>
>
> I then try and run
>
>
>
> and get the following error:
>
>
>
> What am i doing wrong?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Cannot Import Package (spark-csv)

2015-08-02 Thread billchambers
I am trying to import the spark csv package while using the scala spark
shell. Spark 1.4.1, Scala 2.11

I am starting the shell with:

bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars
../sjars/spark-csv_2.11-1.1.0.jar --master local


I then try and run



and get the following error:



What am i doing wrong?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Import-Package-spark-csv-tp24109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark cluster setup

2015-08-02 Thread Sonal Goyal
What do the master logs show?

Best Regards,
Sonal
Founder, Nube Technologies


Check out Reifier at Spark Summit 2015






On Mon, Aug 3, 2015 at 7:46 AM, Angel Angel  wrote:

> Hello Sir,
>
> I have install the spark.
>
>
>
> The local  spark-shell is working fine.
>
>
>
> But whenever I tried the Master configuration I got some errors.
>
>
>
> When I run this command ;
>
> MASTER=spark://hadoopm0:7077 spark-shell
>
>
>
> I gets the errors likes;
>
>
>
> 15/07/27 21:17:26 INFO AppClient$ClientActor: Connecting to master
> spark://hadoopm0:7077...
>
> 15/07/27 21:17:46 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
>
> 15/07/27 21:17:46 WARN SparkDeploySchedulerBackend: Application ID is not
> initialized yet.
>
> 15/07/27 21:17:46 ERROR TaskSchedulerImpl: Exiting due to error from
> cluster scheduler: All masters are unresponsive! Giving up.
>
>
>
> Also I have attached the my screenshot of Master UI.
>
>
> Also i have tested using telnet command:
>
>
> it shows that hadoopm0 is connected
>
>
>
> Can you please give me some references, documentations or  how to solve
> this issue.
>
> Thanks in advance.
>
> Thanking You,
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
"spark uses a lot more than heap memory, it is the expected behavior."  It 
didn't exist in spark 1.3.x
What does "a lot more than" means?  It means that I lose control of it!
I try to  apply 31g, but it still grows to 55g and continues to grow!!! That is 
the point!
I have tried set memoryFraction to 0.2??but it didn't help.
I don't know whether it will still exist in the next release 1.5, I wish not.






--  --
??: "Barak Gitsis";;
: 2015??8??2??(??) 9:55
??: "Sea"<261810...@qq.com>; "Ted Yu"; 
: "user@spark.apache.org"; 
"rxin"; "joshrosen"; 
"davies"; 
: Re: About memory leak in spark 1.4.1



spark uses a lot more than heap memory, it is the expected behavior.in 1.4 
off-heap memory usage is supposed to grow in comparison to 1.3


Better use as little memory as you can for heap, and since you are not 
utilizing it already, it is safe for you to reduce it.
memoryFraction helps you optimize heap usage for your data/application profile 
while keeping it tight.



 






On Sun, Aug 2, 2015 at 12:54 PM Sea <261810...@qq.com> wrote:

spark.storage.memoryFraction is in heap memory, but my situation is that the 
memory is more than heap memory !  


Anyone else use spark 1.4.1 in production? 




--  --
??: "Ted Yu";;
: 2015??8??2??(??) 5:45
??: "Sea"<261810...@qq.com>; 
: "Barak Gitsis"; 
"user@spark.apache.org"; "rxin"; 
"joshrosen"; "davies"; 


: Re: About memory leak in spark 1.4.1




http://spark.apache.org/docs/latest/tuning.html does mention 
spark.storage.memoryFraction in two places.
One is under Cache Size Tuning section.


FYI


On Sun, Aug 2, 2015 at 2:16 AM, Sea <261810...@qq.com> wrote:
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: "Barak Gitsis";;
: 2015??8??2??(??) 4:11
??: "Sea"<261810...@qq.com>; "user"; 
: "rxin"; "joshrosen"; 
"davies"; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea <261810...@qq.com> wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak








-- 

-Barak

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Abhishek R. Singh
I don't know if (your assertion/expectation that) workers will process things 
(multiple partitions) in parallel is really valid. Or if having more partitions 
than workers will necessarily help (unless you are memory bound - so partitions 
is essentially helping your work size rather than execution parallelism).

[Disclaimer: I am no authority on Spark, but wanted to throw my spin based my 
own understanding]. 

Nothing official about it :)

-abhishek-

> On Jul 31, 2015, at 1:03 PM, Sujit Pal  wrote:
> 
> Hello,
> 
> I am trying to run a Spark job that hits an external webservice to get back 
> some information. The cluster is 1 master + 4 workers, each worker has 60GB 
> RAM and 4 CPUs. The external webservice is a standalone Solr server, and is 
> accessed using code similar to that shown below.
> 
>> def getResults(keyValues: Iterator[(String, Array[String])]):
>> Iterator[(String, String)] = {
>> val solr = new HttpSolrClient()
>> initializeSolrParameters(solr)
>> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>> }
>> myRDD.repartition(10) 
>>  .mapPartitions(keyValues => getResults(keyValues))
>  
> The mapPartitions does some initialization to the SolrJ client per partition 
> and then hits it for each record in the partition via the getResults() call.
> 
> I repartitioned in the hope that this will result in 10 clients hitting Solr 
> simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I 
> can). However, I counted the number of open connections using "netstat -anp | 
> grep ":8983.*ESTABLISHED" in a loop on the Solr box and observed that Solr 
> has a constant 4 clients (ie, equal to the number of workers) over the 
> lifetime of the run.
> 
> My observation leads me to believe that each worker processes a single stream 
> of work sequentially. However, from what I understand about how Spark works, 
> each worker should be able to process number of tasks parallelly, and that 
> repartition() is a hint for it to do so.
> 
> Is there some SparkConf environment variable I should set to increase 
> parallelism in these workers, or should I just configure a cluster with 
> multiple workers per machine? Or is there something I am doing wrong?
> 
> Thank you in advance for any pointers you can provide.
> 
> -sujit
> 


Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Steve Loughran

On 2 Aug 2015, at 13:42, Sujit Pal 
mailto:sujitatgt...@gmail.com>> wrote:

There is no additional configuration on the external Solr host from my code, I 
am using the default HttpClient provided by HttpSolrServer. According to the 
Javadocs, you can pass in a HttpClient object as well. Is there some specific 
configuration you would suggest to get past any limits?


Usually there's some thread pooling going on client side, covered in docs like
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html

I don't know if that applies, how to tune it, etc. I do know that if you go the 
other way and allow unlimited connections you raise "different" support 
problems.

-steve


Re: TCP/IP speedup

2015-08-02 Thread Steve Loughran

On 1 Aug 2015, at 18:26, Ruslan Dautkhanov 
mailto:dautkha...@gmail.com>> wrote:

If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000)
may increase bandwidth up to ~20%.

http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm
"Enabling Jumbo Frames across the cluster improves bandwidth"

+1

you can also get better checksums of packets, so that the (very small but 
non-zero) risk of corrupted network packets drops a bit more.


If Spark workload is not network bandwidth-bound, I can see it'll be a few 
percent to no improvement.



Put differently: it shouldn't hurt. The shuffle phase is the most network 
heavy, especially as it can span the entire cluster that backbone bandwidth 
"bisection bandwidth" can become the bottleneck, and mean that jobs can 
interfere

scheduling of work close to the HDFS data means that HDFS reads should often be 
local (the TCP stack gets bypassed entirely), or at least rack-local (sharing 
the switch, not any backbone)


but there's other things there, as the slide talks about


-stragglers: often a sign of pending HDD failure, as reads are retries. the 
classic hadoop MR engine detects these, can spin up alternate mappers (if you 
enable speculation), and will blacklist the node for further work. Sometimes 
though that straggling is just unbalanced data -some bits of work may be 
computationally a lot harder, slowing things down.

-contention for work on the nodes. In YARN you request how many "virtual cores" 
you want (ops get to define the map of virtual to physical), with each node 
having a finite set of cores

but ...
  -Unless CPU throttling is turned on, competing processes can take up more CPU 
than they asked for.
  -that virtual:physical core setting may be of

There's also disk IOP contention; two jobs trying to get at the same spindle, 
even though there are lots of disks on the server. There's not much you can do 
about that (today).

A key takeaway from that talk, which applies to all work-tuning talks is: get 
data from your real workloads, There's some good htrace instrumentation in HDFS 
these days, I haven't looked @ spark's instrumentation to see how they hook up. 
You can also expect to have some network monitoring (sflow, ...) which you 
could use to see if the backbone is overloaded. Don't forget the Linux tooling 
either, iotop &c. There's lots of room to play here -once you've got the data 
you can see where to focus, then decide how much time to spend trying to tune 
it.

-steve


--
Ruslan Dautkhanov

On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus 
mailto:edel...@gmail.com>> wrote:
H

2% huh.


-- ttfn
Simon Edelhaus
California 2015

On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
https://spark-summit.org/2015/events/making-sense-of-spark-performance/

On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus 
mailto:edel...@gmail.com>> wrote:
Hi All!

How important would be a significant performance improvement to TCP/IP itself, 
in terms of
overall job performance improvement. Which part would be most significantly 
accelerated?
Would it be HDFS?

-- ttfn
Simon Edelhaus
California 2015






Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
so how many cores you configure per node?
do u have something like total-executor-cores or maybe
--num-executors config(I'm
not sure what kind of cluster databricks platform provides, if it's
standalone then first option should be used)? if you have 4 cores at total,
then even though you have 4 cores per machine only 1 is working on each
machine...which could be a cause.
another option - you are hitting some default config of limiting number of
concurrent routes or max total connection from jvm,
look at
https://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html
 (assuming you are using HttpClient from 4.x and not 3.x version)
not sure what are the defaults...



On 2 August 2015 at 23:42, Sujit Pal  wrote:

> Hi Igor,
>
> The cluster is a Databricks Spark cluster. It consists of 1 master + 4
> workers, each worker has 60GB RAM and 4 CPUs. The original mail has some
> more details (also the reference to the HttpSolrClient in there should be
> HttpSolrServer, sorry about that, mistake while writing the email).
>
> There is no additional configuration on the external Solr host from my
> code, I am using the default HttpClient provided by HttpSolrServer.
> According to the Javadocs, you can pass in a HttpClient object as well. Is
> there some specific configuration you would suggest to get past any limits?
>
> On another project, I faced a similar problem but I had more leeway (was
> using a Spark cluster from EC2) and less time, my workaround was to use
> python multiprocessing to create a program that started up 30 python
> JSON/HTTP clients and wrote output into 30 output files, which were then
> processed by Spark. Reason I mention this is that I was using default
> configurations there as well, just needed to increase the number of
> connections against Solr to a higher number.
>
> This time round, I would like to do this through Spark because it makes
> the pipeline less complex.
>
> -sujit
>
>
> On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman 
> wrote:
>
>> What kind of cluster? How many cores on each worker? Is there config for
>> http solr client? I remember standard httpclient has limit per route/host.
>> On Aug 2, 2015 8:17 PM, "Sujit Pal"  wrote:
>>
>>> No one has any ideas?
>>>
>>> Is there some more information I should provide?
>>>
>>> I am looking for ways to increase the parallelism among workers.
>>> Currently I just see number of simultaneous connections to Solr equal to
>>> the number of workers. My number of partitions is (2.5x) larger than number
>>> of workers, and the workers seem to be large enough to handle more than one
>>> task at a time.
>>>
>>> I am creating a single client per partition in my mapPartition call. Not
>>> sure if that is creating the gating situation? Perhaps I should use a Pool
>>> of clients instead?
>>>
>>> Would really appreciate some pointers.
>>>
>>> Thanks in advance for any help you can provide.
>>>
>>> -sujit
>>>
>>>
>>> On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal 
>>> wrote:
>>>
 Hello,

 I am trying to run a Spark job that hits an external webservice to get
 back some information. The cluster is 1 master + 4 workers, each worker has
 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
 and is accessed using code similar to that shown below.

 def getResults(keyValues: Iterator[(String, Array[String])]):
> Iterator[(String, String)] = {
> val solr = new HttpSolrClient()
> initializeSolrParameters(solr)
> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
> }
> myRDD.repartition(10)

  .mapPartitions(keyValues => getResults(keyValues))
>

 The mapPartitions does some initialization to the SolrJ client per
 partition and then hits it for each record in the partition via the
 getResults() call.

 I repartitioned in the hope that this will result in 10 clients hitting
 Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
 clients if I can). However, I counted the number of open connections using
 "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
 observed that Solr has a constant 4 clients (ie, equal to the number of
 workers) over the lifetime of the run.

 My observation leads me to believe that each worker processes a single
 stream of work sequentially. However, from what I understand about how
 Spark works, each worker should be able to process number of tasks
 parallelly, and that repartition() is a hint for it to do so.

 Is there some SparkConf environment variable I should set to increase
 parallelism in these workers, or should I just configure a cluster with
 multiple workers per machine? Or is there something I am doing wrong?

 Thank you in advance for any pointers you can provide.

 -sujit


>>>
>


RE: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Silvio Fiorito
Can you share the transformations up to the foreachPartition?

From: Sujit Pal
Sent: ‎8/‎2/‎2015 4:42 PM
To: Igor Berman
Cc: user
Subject: Re: How to increase parallelism of a Spark cluster?

Hi Igor,

The cluster is a Databricks Spark cluster. It consists of 1 master + 4 workers, 
each worker has 60GB RAM and 4 CPUs. The original mail has some more details 
(also the reference to the HttpSolrClient in there should be HttpSolrServer, 
sorry about that, mistake while writing the email).

There is no additional configuration on the external Solr host from my code, I 
am using the default HttpClient provided by HttpSolrServer. According to the 
Javadocs, you can pass in a HttpClient object as well. Is there some specific 
configuration you would suggest to get past any limits?

On another project, I faced a similar problem but I had more leeway (was using 
a Spark cluster from EC2) and less time, my workaround was to use python 
multiprocessing to create a program that started up 30 python JSON/HTTP clients 
and wrote output into 30 output files, which were then processed by Spark. 
Reason I mention this is that I was using default configurations there as well, 
just needed to increase the number of connections against Solr to a higher 
number.

This time round, I would like to do this through Spark because it makes the 
pipeline less complex.

-sujit


On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman 
mailto:igor.ber...@gmail.com>> wrote:

What kind of cluster? How many cores on each worker? Is there config for http 
solr client? I remember standard httpclient has limit per route/host.

On Aug 2, 2015 8:17 PM, "Sujit Pal" 
mailto:sujitatgt...@gmail.com>> wrote:
No one has any ideas?

Is there some more information I should provide?

I am looking for ways to increase the parallelism among workers. Currently I 
just see number of simultaneous connections to Solr equal to the number of 
workers. My number of partitions is (2.5x) larger than number of workers, and 
the workers seem to be large enough to handle more than one task at a time.

I am creating a single client per partition in my mapPartition call. Not sure 
if that is creating the gating situation? Perhaps I should use a Pool of 
clients instead?

Would really appreciate some pointers.

Thanks in advance for any help you can provide.

-sujit


On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal 
mailto:sujitatgt...@gmail.com>> wrote:
Hello,

I am trying to run a Spark job that hits an external webservice to get back 
some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM 
and 4 CPUs. The external webservice is a standalone Solr server, and is 
accessed using code similar to that shown below.

def getResults(keyValues: Iterator[(String, Array[String])]):
Iterator[(String, String)] = {
val solr = new HttpSolrClient()
initializeSolrParameters(solr)
keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
}
myRDD.repartition(10)
 .mapPartitions(keyValues => getResults(keyValues))

The mapPartitions does some initialization to the SolrJ client per partition 
and then hits it for each record in the partition via the getResults() call.

I repartitioned in the hope that this will result in 10 clients hitting Solr 
simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I 
can). However, I counted the number of open connections using "netstat -anp | 
grep ":8983.*ESTABLISHED" in a loop on the Solr box and observed that Solr has 
a constant 4 clients (ie, equal to the number of workers) over the lifetime of 
the run.

My observation leads me to believe that each worker processes a single stream 
of work sequentially. However, from what I understand about how Spark works, 
each worker should be able to process number of tasks parallelly, and that 
repartition() is a hint for it to do so.

Is there some SparkConf environment variable I should set to increase 
parallelism in these workers, or should I just configure a cluster with 
multiple workers per machine? Or is there something I am doing wrong?

Thank you in advance for any pointers you can provide.

-sujit





Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
Hi Igor,

The cluster is a Databricks Spark cluster. It consists of 1 master + 4
workers, each worker has 60GB RAM and 4 CPUs. The original mail has some
more details (also the reference to the HttpSolrClient in there should be
HttpSolrServer, sorry about that, mistake while writing the email).

There is no additional configuration on the external Solr host from my
code, I am using the default HttpClient provided by HttpSolrServer.
According to the Javadocs, you can pass in a HttpClient object as well. Is
there some specific configuration you would suggest to get past any limits?

On another project, I faced a similar problem but I had more leeway (was
using a Spark cluster from EC2) and less time, my workaround was to use
python multiprocessing to create a program that started up 30 python
JSON/HTTP clients and wrote output into 30 output files, which were then
processed by Spark. Reason I mention this is that I was using default
configurations there as well, just needed to increase the number of
connections against Solr to a higher number.

This time round, I would like to do this through Spark because it makes the
pipeline less complex.

-sujit


On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman  wrote:

> What kind of cluster? How many cores on each worker? Is there config for
> http solr client? I remember standard httpclient has limit per route/host.
> On Aug 2, 2015 8:17 PM, "Sujit Pal"  wrote:
>
>> No one has any ideas?
>>
>> Is there some more information I should provide?
>>
>> I am looking for ways to increase the parallelism among workers.
>> Currently I just see number of simultaneous connections to Solr equal to
>> the number of workers. My number of partitions is (2.5x) larger than number
>> of workers, and the workers seem to be large enough to handle more than one
>> task at a time.
>>
>> I am creating a single client per partition in my mapPartition call. Not
>> sure if that is creating the gating situation? Perhaps I should use a Pool
>> of clients instead?
>>
>> Would really appreciate some pointers.
>>
>> Thanks in advance for any help you can provide.
>>
>> -sujit
>>
>>
>> On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal 
>> wrote:
>>
>>> Hello,
>>>
>>> I am trying to run a Spark job that hits an external webservice to get
>>> back some information. The cluster is 1 master + 4 workers, each worker has
>>> 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
>>> and is accessed using code similar to that shown below.
>>>
>>> def getResults(keyValues: Iterator[(String, Array[String])]):
 Iterator[(String, String)] = {
 val solr = new HttpSolrClient()
 initializeSolrParameters(solr)
 keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
 }
 myRDD.repartition(10)
>>>
>>>  .mapPartitions(keyValues => getResults(keyValues))

>>>
>>> The mapPartitions does some initialization to the SolrJ client per
>>> partition and then hits it for each record in the partition via the
>>> getResults() call.
>>>
>>> I repartitioned in the hope that this will result in 10 clients hitting
>>> Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
>>> clients if I can). However, I counted the number of open connections using
>>> "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
>>> observed that Solr has a constant 4 clients (ie, equal to the number of
>>> workers) over the lifetime of the run.
>>>
>>> My observation leads me to believe that each worker processes a single
>>> stream of work sequentially. However, from what I understand about how
>>> Spark works, each worker should be able to process number of tasks
>>> parallelly, and that repartition() is a hint for it to do so.
>>>
>>> Is there some SparkConf environment variable I should set to increase
>>> parallelism in these workers, or should I just configure a cluster with
>>> multiple workers per machine? Or is there something I am doing wrong?
>>>
>>> Thank you in advance for any pointers you can provide.
>>>
>>> -sujit
>>>
>>>
>>


how to ignore MatchError then processing a large json file in spark-sql

2015-08-02 Thread fuellee lee
I'm trying to process a bunch of large json log files with spark, but it
fails every time with `scala.MatchError`, Whether I give it schema or not.

I just want to skip lines that does not match schema, but I can't find how
in docs of spark.

I know write a json parser and map it to json file RDD can get things done,
but I want to use
`sqlContext.read.schema(schema).json(fileNames).selectExpr(...)` because
it's much easier to maintain.

thanks


Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
What kind of cluster? How many cores on each worker? Is there config for
http solr client? I remember standard httpclient has limit per route/host.
On Aug 2, 2015 8:17 PM, "Sujit Pal"  wrote:

> No one has any ideas?
>
> Is there some more information I should provide?
>
> I am looking for ways to increase the parallelism among workers. Currently
> I just see number of simultaneous connections to Solr equal to the number
> of workers. My number of partitions is (2.5x) larger than number of
> workers, and the workers seem to be large enough to handle more than one
> task at a time.
>
> I am creating a single client per partition in my mapPartition call. Not
> sure if that is creating the gating situation? Perhaps I should use a Pool
> of clients instead?
>
> Would really appreciate some pointers.
>
> Thanks in advance for any help you can provide.
>
> -sujit
>
>
> On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal  wrote:
>
>> Hello,
>>
>> I am trying to run a Spark job that hits an external webservice to get
>> back some information. The cluster is 1 master + 4 workers, each worker has
>> 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
>> and is accessed using code similar to that shown below.
>>
>> def getResults(keyValues: Iterator[(String, Array[String])]):
>>> Iterator[(String, String)] = {
>>> val solr = new HttpSolrClient()
>>> initializeSolrParameters(solr)
>>> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>>> }
>>> myRDD.repartition(10)
>>
>>  .mapPartitions(keyValues => getResults(keyValues))
>>>
>>
>> The mapPartitions does some initialization to the SolrJ client per
>> partition and then hits it for each record in the partition via the
>> getResults() call.
>>
>> I repartitioned in the hope that this will result in 10 clients hitting
>> Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
>> clients if I can). However, I counted the number of open connections using
>> "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
>> observed that Solr has a constant 4 clients (ie, equal to the number of
>> workers) over the lifetime of the run.
>>
>> My observation leads me to believe that each worker processes a single
>> stream of work sequentially. However, from what I understand about how
>> Spark works, each worker should be able to process number of tasks
>> parallelly, and that repartition() is a hint for it to do so.
>>
>> Is there some SparkConf environment variable I should set to increase
>> parallelism in these workers, or should I just configure a cluster with
>> multiple workers per machine? Or is there something I am doing wrong?
>>
>> Thank you in advance for any pointers you can provide.
>>
>> -sujit
>>
>>
>


Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
No one has any ideas?

Is there some more information I should provide?

I am looking for ways to increase the parallelism among workers. Currently
I just see number of simultaneous connections to Solr equal to the number
of workers. My number of partitions is (2.5x) larger than number of
workers, and the workers seem to be large enough to handle more than one
task at a time.

I am creating a single client per partition in my mapPartition call. Not
sure if that is creating the gating situation? Perhaps I should use a Pool
of clients instead?

Would really appreciate some pointers.

Thanks in advance for any help you can provide.

-sujit


On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal  wrote:

> Hello,
>
> I am trying to run a Spark job that hits an external webservice to get
> back some information. The cluster is 1 master + 4 workers, each worker has
> 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
> and is accessed using code similar to that shown below.
>
> def getResults(keyValues: Iterator[(String, Array[String])]):
>> Iterator[(String, String)] = {
>> val solr = new HttpSolrClient()
>> initializeSolrParameters(solr)
>> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>> }
>> myRDD.repartition(10)
>
>  .mapPartitions(keyValues => getResults(keyValues))
>>
>
> The mapPartitions does some initialization to the SolrJ client per
> partition and then hits it for each record in the partition via the
> getResults() call.
>
> I repartitioned in the hope that this will result in 10 clients hitting
> Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
> clients if I can). However, I counted the number of open connections using
> "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
> observed that Solr has a constant 4 clients (ie, equal to the number of
> workers) over the lifetime of the run.
>
> My observation leads me to believe that each worker processes a single
> stream of work sequentially. However, from what I understand about how
> Spark works, each worker should be able to process number of tasks
> parallelly, and that repartition() is a hint for it to do so.
>
> Is there some SparkConf environment variable I should set to increase
> parallelism in these workers, or should I just configure a cluster with
> multiple workers per machine? Or is there something I am doing wrong?
>
> Thank you in advance for any pointers you can provide.
>
> -sujit
>
>


Re: TCP/IP speedup

2015-08-02 Thread Michael Segel
This may seem like a silly question… but in following Mark’s link, the 
presentation talks about the TPC-DS benchmark. 

Here’s my question… what benchmark results? 

If you go over to the TPC.org  website they have no TPC-DS 
benchmarks listed. 
(Either audited or unaudited) 

So what gives? 

Note: There are TPCx-HS benchmarks listed… 

Thx

-Mike

> On Aug 1, 2015, at 5:45 PM, Mark Hamstra  wrote:
> 
> https://spark-summit.org/2015/events/making-sense-of-spark-performance/ 
> 
> 
> On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus  > wrote:
> Hi All!
> 
> How important would be a significant performance improvement to TCP/IP 
> itself, in terms of 
> overall job performance improvement. Which part would be most significantly 
> accelerated? 
> Would it be HDFS?
> 
> -- ttfn
> Simon Edelhaus
> California 2015
> 




Re: spark no output

2015-08-02 Thread Connor Zanin
I agree with Ted. Could you please post the log file?
On Aug 2, 2015 10:13 AM, "Ted Yu"  wrote:

> Can you provide some more detai:
>
> release of Spark you're using
> were you running in standalone or YARN cluster mode
> have you checked driver log ?
>
> Cheers
>
> On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö 
> wrote:
>
>> hi community,
>>
>> i have run my k-means spark application on 1million data points. the
>> program works, but no output in the hdfs is generated. when it runs on
>> 10.000 points, a output is written.
>>
>> maybe someone has an idea?
>>
>> best regards,
>> paul
>>
>
>


Re: spark no output

2015-08-02 Thread Ted Yu
Can you provide some more detai:

release of Spark you're using
were you running in standalone or YARN cluster mode
have you checked driver log ?

Cheers

On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö 
wrote:

> hi community,
>
> i have run my k-means spark application on 1million data points. the
> program works, but no output in the hdfs is generated. when it runs on
> 10.000 points, a output is written.
>
> maybe someone has an idea?
>
> best regards,
> paul
>


spark no output

2015-08-02 Thread Pa Rö
hi community,

i have run my k-means spark application on 1million data points. the
program works, but no output in the hdfs is generated. when it runs on
10.000 points, a output is written.

maybe someone has an idea?

best regards,
paul


Re: About memory leak in spark 1.4.1

2015-08-02 Thread Barak Gitsis
spark uses a lot more than heap memory, it is the expected behavior.
in 1.4 off-heap memory usage is supposed to grow in comparison to 1.3

Better use as little memory as you can for heap, and since you are not
utilizing it already, it is safe for you to reduce it.
memoryFraction helps you optimize heap usage for your data/application
profile while keeping it tight.






On Sun, Aug 2, 2015 at 12:54 PM Sea <261810...@qq.com> wrote:

> spark.storage.memoryFraction is in heap memory, but my situation is that
> the memory is more than heap memory !
>
> Anyone else use spark 1.4.1 in production?
>
>
> -- 原始邮件 --
> *发件人:* "Ted Yu";;
> *发送时间:* 2015年8月2日(星期天) 下午5:45
> *收件人:* "Sea"<261810...@qq.com>;
> *抄送:* "Barak Gitsis"; "user@spark.apache.org"<
> user@spark.apache.org>; "rxin"; "joshrosen"<
> joshro...@databricks.com>; "davies";
> *主题:* Re: About memory leak in spark 1.4.1
>
> http://spark.apache.org/docs/latest/tuning.html does mention 
> spark.storage.memoryFraction
> in two places.
> One is under Cache Size Tuning section.
>
> FYI
>
> On Sun, Aug 2, 2015 at 2:16 AM, Sea <261810...@qq.com> wrote:
>
>> Hi, Barak
>> It is ok with spark 1.3.0, the problem is with spark 1.4.1.
>> I don't think spark.storage.memoryFraction will make any sense,
>> because it is still in heap memory.
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Barak Gitsis";;
>> *发送时间:* 2015年8月2日(星期天) 下午4:11
>> *收件人:* "Sea"<261810...@qq.com>; "user";
>> *抄送:* "rxin"; "joshrosen";
>> "davies";
>> *主题:* Re: About memory leak in spark 1.4.1
>>
>> Hi,
>> reducing spark.storage.memoryFraction did the trick for me. Heap doesn't
>> get filled because it is reserved..
>> My reasoning is:
>> I give executor all the memory i can give it, so that makes it a boundary
>> .
>> From here i try to make the best use of memory I can.
>> storage.memoryFraction is in a sense user data space.  The rest can be used
>> by the system.
>> If you don't have so much data that you MUST store in memory for
>> performance, better give spark more space..
>> ended up setting it to 0.3
>>
>> All that said, it is on spark 1.3 on cluster
>>
>> hope that helps
>>
>> On Sat, Aug 1, 2015 at 5:43 PM Sea <261810...@qq.com> wrote:
>>
>>> Hi, all
>>> I upgrage spark to 1.4.1, many applications failed... I find the heap
>>> memory is not full , but the process of CoarseGrainedExecutorBackend will
>>> take more memory than I expect, and it will increase as time goes on,
>>> finally more than max limited of the server, the worker will die.
>>>
>>> Any can help?
>>>
>>> Mode:standalone
>>>
>>> spark.executor.memory 50g
>>>
>>> 25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java
>>>
>>> 55g more than 50g I apply
>>>
>>> --
>> *-Barak*
>>
>
> --
*-Barak*


Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
spark.storage.memoryFraction is in heap memory, but my situation is that the 
memory is more than heap memory !  


Anyone else use spark 1.4.1 in production? 




--  --
??: "Ted Yu";;
: 2015??8??2??(??) 5:45
??: "Sea"<261810...@qq.com>; 
: "Barak Gitsis"; 
"user@spark.apache.org"; "rxin"; 
"joshrosen"; "davies"; 
: Re: About memory leak in spark 1.4.1



http://spark.apache.org/docs/latest/tuning.html does mention 
spark.storage.memoryFraction in two places.
One is under Cache Size Tuning section.


FYI


On Sun, Aug 2, 2015 at 2:16 AM, Sea <261810...@qq.com> wrote:
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: "Barak Gitsis";;
: 2015??8??2??(??) 4:11
??: "Sea"<261810...@qq.com>; "user"; 
: "rxin"; "joshrosen"; 
"davies"; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea <261810...@qq.com> wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Jörn Franke
I think you use case can already be implemented with HDFS encryption and/or
SealedObject, if you look for sth like Altibase.

If you create a JIRA you may want to set the bar a little bit higher and
propose sth like MIT cryptdb: https://css.csail.mit.edu/cryptdb/

Le ven. 31 juil. 2015 à 10:17, Matthew O'Reilly  a
écrit :

> Hi,
>
> I am currently working on the latest version of Apache Spark (1.4.1),
> pre-built package for Hadoop 2.6+.
>
> Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
> (something similar is Altibase's HDB:
> http://altibase.com/in-memory-database-computing-solutions/security/)
> when running applications in Spark? Or is there an external
> library/framework which could be used to encrypt RDDs or in-memory/cache in
> Spark?
>
> I discovered it is possible to encrypt the data, and encapsulate it into
> RDD. However, I feel this affects Spark's fast data processing as it is
> slower to encrypt the data, and then encapsulate it to RDD; it's then a two
> step process. Encryption and storing data should be done parallel.
>
> Any help would be appreciated.
>
> Many thanks,
> Matthew
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: About memory leak in spark 1.4.1

2015-08-02 Thread Ted Yu
http://spark.apache.org/docs/latest/tuning.html does mention
spark.storage.memoryFraction
in two places.
One is under Cache Size Tuning section.

FYI

On Sun, Aug 2, 2015 at 2:16 AM, Sea <261810...@qq.com> wrote:

> Hi, Barak
> It is ok with spark 1.3.0, the problem is with spark 1.4.1.
> I don't think spark.storage.memoryFraction will make any sense,
> because it is still in heap memory.
>
>
> -- 原始邮件 --
> *发件人:* "Barak Gitsis";;
> *发送时间:* 2015年8月2日(星期天) 下午4:11
> *收件人:* "Sea"<261810...@qq.com>; "user";
> *抄送:* "rxin"; "joshrosen";
> "davies";
> *主题:* Re: About memory leak in spark 1.4.1
>
> Hi,
> reducing spark.storage.memoryFraction did the trick for me. Heap doesn't
> get filled because it is reserved..
> My reasoning is:
> I give executor all the memory i can give it, so that makes it a boundary.
> From here i try to make the best use of memory I can.
> storage.memoryFraction is in a sense user data space.  The rest can be used
> by the system.
> If you don't have so much data that you MUST store in memory for
> performance, better give spark more space..
> ended up setting it to 0.3
>
> All that said, it is on spark 1.3 on cluster
>
> hope that helps
>
> On Sat, Aug 1, 2015 at 5:43 PM Sea <261810...@qq.com> wrote:
>
>> Hi, all
>> I upgrage spark to 1.4.1, many applications failed... I find the heap
>> memory is not full , but the process of CoarseGrainedExecutorBackend will
>> take more memory than I expect, and it will increase as time goes on,
>> finally more than max limited of the server, the worker will die.
>>
>> Any can help?
>>
>> Mode:standalone
>>
>> spark.executor.memory 50g
>>
>> 25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java
>>
>> 55g more than 50g I apply
>>
>> --
> *-Barak*
>


Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: "Barak Gitsis";;
: 2015??8??2??(??) 4:11
??: "Sea"<261810...@qq.com>; "user"; 
: "rxin"; "joshrosen"; 
"davies"; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea <261810...@qq.com> wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Akhil Das
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA
to add this feature and may be in future release it could be added.

Thanks
Best Regards

On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly 
wrote:

> Hi,
>
> I am currently working on the latest version of Apache Spark (1.4.1),
> pre-built package for Hadoop 2.6+.
>
> Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
> (something similar is Altibase's HDB:
> http://altibase.com/in-memory-database-computing-solutions/security/)
> when running applications in Spark? Or is there an external
> library/framework which could be used to encrypt RDDs or in-memory/cache in
> Spark?
>
> I discovered it is possible to encrypt the data, and encapsulate it into
> RDD. However, I feel this affects Spark's fast data processing as it is
> slower to encrypt the data, and then encapsulate it to RDD; it's then a two
> step process. Encryption and storing data should be done parallel.
>
> Any help would be appreciated.
>
> Many thanks,
> Matthew
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: About memory leak in spark 1.4.1

2015-08-02 Thread Barak Gitsis
Hi,
reducing spark.storage.memoryFraction did the trick for me. Heap doesn't
get filled because it is reserved..
My reasoning is:
I give executor all the memory i can give it, so that makes it a boundary.
>From here i try to make the best use of memory I can.
storage.memoryFraction is in a sense user data space.  The rest can be used
by the system.
If you don't have so much data that you MUST store in memory for
performance, better give spark more space..
ended up setting it to 0.3

All that said, it is on spark 1.3 on cluster

hope that helps

On Sat, Aug 1, 2015 at 5:43 PM Sea <261810...@qq.com> wrote:

> Hi, all
> I upgrage spark to 1.4.1, many applications failed... I find the heap
> memory is not full , but the process of CoarseGrainedExecutorBackend will
> take more memory than I expect, and it will increase as time goes on,
> finally more than max limited of the server, the worker will die.
>
> Any can help?
>
> Mode:standalone
>
> spark.executor.memory 50g
>
> 25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java
>
> 55g more than 50g I apply
>
> --
*-Barak*


Re: Does Spark Streaming need to list all the files in a directory?

2015-08-02 Thread Akhil Das
I guess it goes through that 500k files
for
the first time and then use a filter from next time.

Thanks
Best Regards

On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das  wrote:

> For the first time it needs to list them. AFter that the list should be
> cached by the file stream implementation (as far as I remember).
>
>
> On Thu, Jul 30, 2015 at 3:55 PM, Brandon White 
> wrote:
>
>> Is this a known bottle neck for Spark Streaming textFileStream? Does it
>> need to list all the current files in a directory before he gets the new
>> files? Say I have 500k files in a directory, does it list them all in order
>> to get the new files?
>>
>
>


Re: unsubscribe

2015-08-02 Thread Akhil Das
​LOL Brandon!

@ziqiu See http://spark.apache.org/community.html

You need to send an email to user-unsubscr...@spark.apache.org​

Thanks
Best Regards

On Fri, Jul 31, 2015 at 2:06 AM, Brandon White 
wrote:

> https://www.youtube.com/watch?v=JncgoPKklVE
>
> On Thu, Jul 30, 2015 at 1:30 PM,  wrote:
>
>>
>>
>> --
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>>
>> __
>>
>> www.accenture.com
>>
>
>