In Spark-scala, how to fill Vectors.dense in DataFrame from CSV?

2016-09-22 Thread Dan Bikle
hello spark-world,

I am new to spark.

I noticed this online example:

http://spark.apache.org/docs/latest/ml-pipeline.html

I am curious about this syntax:

// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

Is it possible to replace the above call to some syntax which reads values
from CSV?

I want something comparable to Python-Pandas read_csv() method.


pyspark ML example not working

2016-09-22 Thread jypucca
 
I installed Spark 2.0.0, and was trying the ML example (IndexToString) on
this web
page:http://spark.apache.org/docs/latest/ml-features.html#onehotencoder,
using jupyter notebook (running Pyspark) to create a simple dataframe, and I
keep getting a long error message (see below). Pyspark has worked fine with
RDD, but anytime I try to do anything with DataFrame it keep throwing out
error messages. Any help would be appreciated, thanks!

***
Py4JJavaError: An error occurred while calling o23.applySchemaToPythonRDD.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:171)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:666)
at
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:656)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
... 32 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
... 38 more
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test
connection to the 

Re: spark stream on yarn oom

2016-09-22 Thread manasdebashiskar
It appears that the version against which your program is compiled is
different than that of the spark version you are running your code against.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-on-yarn-oom-tp27766p27782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: In Spark-scala, how to fill Vectors.dense in DataFrame from CSV?

2016-09-22 Thread Kevin Mellott
You'll want to use the spark-csv package, which is included in Spark 2.0.
The repository documentation has some great usage examples.

https://github.com/databricks/spark-csv

Thanks,
Kevin

On Thu, Sep 22, 2016 at 8:40 PM, Dan Bikle  wrote:

> hello spark-world,
>
> I am new to spark.
>
> I noticed this online example:
>
> http://spark.apache.org/docs/latest/ml-pipeline.html
>
> I am curious about this syntax:
>
> // Prepare training data from a list of (label, features) tuples.
> val training = spark.createDataFrame(Seq(
>   (1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   (0.0, Vectors.dense(2.0, 1.0, -1.0)),
>   (0.0, Vectors.dense(2.0, 1.3, 1.0)),
>   (1.0, Vectors.dense(0.0, 1.2, -0.5))
> )).toDF("label", "features")
>
> Is it possible to replace the above call to some syntax which reads values
> from CSV?
>
> I want something comparable to Python-Pandas read_csv() method.
>
>


Redshift Vs Spark SQL (Thrift)

2016-09-22 Thread ayan guha
Hi

Is there any benchmark or point of view in terms of pros and cons between
AWS Redshift vs Spark SQL through STS?

-- 
Best Regards,
Ayan Guha


Re: Spark RDD and Memory

2016-09-22 Thread Aditya

Thanks for the reply.

One more question.
How spark handles data if it does not fit in memory? The answer which I 
got is that it flushes the data to disk and handle the memory issue.

Plus in below example.
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")
val join = textFile.join(textFile1)
join.saveAsTextFile("/home/output")
val count = join.count()

When the first action is performed it loads textFile and textFile1 in 
memory, performes join and save the result.
But when the second action (count) is called, it again loads textFile 
and textFile1 in memory and again performs the join operation?
If it loads again what is the correct way to prevent it from loading 
again again the same data?


On Thursday 22 September 2016 11:12 PM, Mich Talebzadeh wrote:

Hi,

unpersist works on storage memory not execution memory. So I do not 
think you can flush it out of memory if you have not cached it using 
cache or something like below in the first place.


s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

s.unpersist

I believe the recent versions of Spark deploy Least Recently Used 
(LRU) mechanism to flush unused data out of memory much like RBMS 
cache management. I know LLDAP does that.


HTH



Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.



On 22 September 2016 at 18:09, Hanumath Rao Maduri > wrote:


Hello Aditya,

After an intermediate action has been applied you might want to
call rdd.unpersist() to let spark know that this rdd is no longer
required.

Thanks,
-Hanu

On Thu, Sep 22, 2016 at 7:54 AM, Aditya
> wrote:

Hi,

Suppose I have two RDDs
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")

Later I perform a join operation on above two RDDs
val join = textFile.join(textFile1)

And there are subsequent transformations without including
textFile and textFile1 further and an action to start the
execution.

When action is called, textFile and textFile1 will be loaded
in memory first. Later join will be performed and kept in memory.
My question is once join is there memory and is used for
subsequent execution, what happens to textFile and textFile1
RDDs. Are they still kept in memory untill the full lineage
graph is completed or is it destroyed once its use is over? If
it is kept in memory, is there any way I can explicitly remove
it from memory to free the memory?





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org










Fwd: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-22 Thread sagarcasual .
Also you mentioned about streaming-kafka-0-10 connector, what connector is
this, do you know the dependency ? I did not see mention of it in the
documents
For current Spark 1.6.1 to Kafka 0.10.0.1 standalone, the only dependencies
I have are

org.apache.spark:spark-core_2.10:1.6.1
compile group: 'org.apache.spark', name: 'spark-streaming_2.10', version:'1.6.1'
compile group: 'org.apache.spark', name: 'spark-streaming-kafka_2.10',
version:'1.6.1'
compile group: 'org.apache.spark', name: 'spark-sql_2.10', version: '1.6.1'

For Spark 2.0 with Kafka 0.10.0.1 do I need to have a different kafka
connector dependency?


On Thu, Sep 22, 2016 at 2:21 PM, sagarcasual . 
wrote:

> Hi Cody,
> Thanks for the response.
> One thing I forgot to mention is I am using a Direct Approach (No
> receivers) in Spark streaming.
>
> I am not sure if I have that leverage to upgrade at this point, but do you
> know if Spark 1.6.1 to Spark 2.0 jump is smooth usually or does it involve
> lot of hick-ups.
> Also is there a migration guide or something?
>
> -Regards
> Sagar
>
> On Thu, Sep 22, 2016 at 1:39 PM, Cody Koeninger 
> wrote:
>
>> Do you have the ability to try using Spark 2.0 with the
>> streaming-kafka-0-10 connector?
>>
>> I'd expect the 1.6.1 version to be compatible with kafka 0.10, but it
>> would be good to rule that out.
>>
>> On Thu, Sep 22, 2016 at 1:37 PM, sagarcasual . 
>> wrote:
>> > Hello,
>> >
>> > I am trying to stream data out of kafka cluster (2.11_0.10.0.1) using
>> Spark
>> > 1.6.1
>> > I am receiving following error, and I confirmed that Topic to which I am
>> > trying to connect exists with the data .
>> >
>> > Any idea what could be the case?
>> >
>> > kafka.common.UnknownTopicOrPartitionException
>> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> > at
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>> ConstructorAccessorImpl.java:62)
>> > at
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>> legatingConstructorAccessorImpl.java:45)
>> > at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>> > at java.lang.Class.newInstance(Class.java:442)
>> > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:102)
>> > at
>> > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.h
>> andleFetchErr(KafkaRDD.scala:184)
>> > at
>> > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.f
>> etchBatch(KafkaRDD.scala:193)
>> > at
>> > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.g
>> etNext(KafkaRDD.scala:208)
>> > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>> > at
>> > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wr
>> appers.scala:29)
>> >
>> >
>>
>
>


Re: Is executor computing time affected by network latency?

2016-09-22 Thread Peter Figliozzi
It seems to me they must communicate for joins, sorts, grouping, and so
forth, where the original data partitioning needs to change.  You could
repeat your experiment for different code snippets.  I'll bet it depends on
what you do.

On Thu, Sep 22, 2016 at 8:54 AM, gusiri  wrote:

> Hi,
>
> When I increase the network latency among spark nodes,
>
> I see compute time (=executor computing time in Spark Web UI) also
> increases.
>
> In the graph attached, left = latency 1ms vs right = latency 500ms.
>
> Is there any communication between worker and driver/master even 'during'
> executor computing? or any idea on this result?
>
>
>  file/n27779/Screen_Shot_2016-09-21_at_5.png>
>
>
>
>
>
> Thank you very much in advance.
>
> //gusiri
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Is-executor-computing-time-
> affected-by-network-latency-tp27779.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Is executor computing time affected by network latency?

2016-09-22 Thread Soumitra Johri
If your job involves a shuffle then the compute for the entire batch will
increase with network latency. What would be interesting is to see how much
time each task/job/stage takes.
On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi 
wrote:

> It seems to me they must communicate for joins, sorts, grouping, and so
> forth, where the original data partitioning needs to change.  You could
> repeat your experiment for different code snippets.  I'll bet it depends on
> what you do.
>
> On Thu, Sep 22, 2016 at 8:54 AM, gusiri  wrote:
>
>> Hi,
>>
>> When I increase the network latency among spark nodes,
>>
>> I see compute time (=executor computing time in Spark Web UI) also
>> increases.
>>
>> In the graph attached, left = latency 1ms vs right = latency 500ms.
>>
>> Is there any communication between worker and driver/master even 'during'
>> executor computing? or any idea on this result?
>>
>>
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n27779/Screen_Shot_2016-09-21_at_5.png
>> >
>>
>>
>>
>>
>>
>> Thank you very much in advance.
>>
>> //gusiri
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-executor-computing-time-affected-by-network-latency-tp27779.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: very high maxresults setting (no collect())

2016-09-22 Thread Adrian Bridgett

Hi Michael,

No spark upgrade, we've been changing some of our data pipelines so the 
data volumes have probably been getting a bit larger.  Just in the last 
few weeks we've seen quite a few jobs needing a larger maxResultSize. 
Some jobs have gone from "fine with 1GB default" to 3GB.   Wondering 
what besides a collect could cause this (as there's certainly not an 
explicit collect()).


Mesos, parquet source data, a broadcast of a small table earlier which 
is joined then just a few aggregations, select, coalesce and spark-csv 
write.  The executors go along nicely (as does the driver) and then we 
start to hit memory pressure on the driver in the output loop and the 
job grinds to a crawl (we eventually have to kill it and restart with 
more memory).


Adrian

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Memory usage by Spark jobs

2016-09-22 Thread Jörn Franke
You should take also into account that spark has different option to represent 
data in-memory, such as Java serialized objects, Kyro serialized, Tungsten 
(columnar optionally compressed) etc. the tungsten thing depends heavily on the 
underlying data and sorting especially if compressed.
Then, you might think also about broadcasted data etc.

As such I am not aware of a specific guide, but there is also no magic behind 
it. could be a good jira task :) 

> On 22 Sep 2016, at 08:36, Hemant Bhanawat  wrote:
> 
> I am working on profiling TPCH queries for Spark 2.0.  I see lot of temporary 
> object creation (sometimes size as much as the data size) which is justified 
> for the kind of processing Spark does. But, from production perspective, is 
> there a guideline on how much memory should be allocated for processing a 
> specific data size of let's say parquet data? Also, has someone investigated 
> memory usage for the individual SQL operators like Filter, group by, order 
> by, Exchange etc.? 
> 
> Hemant Bhanawat
> www.snappydata.io 


Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-22 Thread Olivier Girardot

Hi,when we fetch Spark 2.0.0 as maven dependency then we automatically end up
with hadoop 2.2 as a transitive dependency, I know multiple profiles are used to
generate the different tar.gz bundles that we can download, Is there by any
chance publications of Spark 2.0.0 with different classifier according to
different versions of Hadoop available ?
Thanks for your time !
Olivier Girardot

Memory usage by Spark jobs

2016-09-22 Thread Hemant Bhanawat
I am working on profiling TPCH queries for Spark 2.0.  I see lot of
temporary object creation (sometimes size as much as the data size) which
is justified for the kind of processing Spark does. But, from production
perspective, is there a guideline on how much memory should be allocated
for processing a specific data size of let's say parquet data? Also, has
someone investigated memory usage for the individual SQL operators like
Filter, group by, order by, Exchange etc.?

Hemant Bhanawat 
www.snappydata.io


Re: Apache Spark JavaRDD pipe() need help

2016-09-22 Thread शशिकांत कुलकर्णी
Hello Jakob,

Thanks for replying. Here is a short example of what I am trying. Taking an
example of Product column family in Cassandra just for explaining my
requirement

In Driver.java
{
 JavaRDD productsRdd = Get Products from Cassandra;
 productsRdd.map(ProductHelper.processProduct());
}

in ProductHelper.java
{

public static Function processProduct() {
return new Function< Product, Boolean>(){
private static final long serialVersionUID = 1L;

@Override
public Boolean call(Product product) throws Exception {
//STEP 1: Doing some processing on product object.
//STEP 2: Now using few values of product, I need to create a string like
"name id sku datetime"
//STEP 3: Pass this string to my C binary file to perform some complex
calculations and return some data
//STEP 4: Get the return data and store it back in Cassandra DB
}
};
}
}

In this ProductHelper, I cannot pass and don't want to pass sparkContext
object as app will throw error of "task not serializable". If there is a
way let me know.

Now I am not able to achieve STEP 3 above. How can I pass a String to C
binary and get the output back in my program. The C binary reads data from
STDIN and outputs data to STDOUT. It is working from other part of
application from PHP. I want to reuse the same C binary in my Apache SPARK
application for some background processing and analysis using
JavaRDD.pipe() API. If there is any other way let me know. This code will
be executed in all the nodes in a cluster.

Hope my requirement is now clear. How to do this?

Regards,
Shash

On Thu, Sep 22, 2016 at 4:13 AM, Jakob Odersky  wrote:

> Can you provide more details? It's unclear what you're asking
>
> On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com
>  wrote:
> > Hi All,
> >
> > I am trying to use the JavaRDD.pipe() API.
> >
> > I have one object with me from the JavaRDD
>


Open source Spark based projects

2016-09-22 Thread tahirhn
I am planning to write a thesis on certain aspects (i.e testing, performance
optimisation, security) of Apache Spark. I need to study some projects that
are based on Apache Spark and are available as open source. 

If you know any such project (open source Spark based project), Please share
it here. Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-22 Thread Sean Owen
There can be just one published version of the Spark artifacts and they
have to depend on something, though in truth they'd be binary-compatible
with anything 2.2+. So you merely manage the dependency versions up to the
desired version in your .

On Thu, Sep 22, 2016 at 7:05 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi,
> when we fetch Spark 2.0.0 as maven dependency then we automatically end up
> with hadoop 2.2 as a transitive dependency, I know multiple profiles are
> used to generate the different tar.gz bundles that we can download, Is
> there by any chance publications of Spark 2.0.0 with different classifier
> according to different versions of Hadoop available ?
>
> Thanks for your time !
>
> *Olivier Girardot*
>


Re: Open source Spark based projects

2016-09-22 Thread Sean Owen
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
and maybe related ...
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

On Thu, Sep 22, 2016 at 11:15 AM, tahirhn  wrote:
> I am planning to write a thesis on certain aspects (i.e testing, performance
> optimisation, security) of Apache Spark. I need to study some projects that
> are based on Apache Spark and are available as open source.
>
> If you know any such project (open source Spark based project), Please share
> it here. Thanks
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Application Log

2016-09-22 Thread Bedrytski Aliaksandr
Hi Divya,

Have you tried this command *yarn logs -applicationId
application_x_ *?
(where application_x_ is the id of the application and
may be found in the output of the 'spark-submit' command or in the
yarn's webui)
It will collect the logs from all the executors in one output.

Regards,
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Thu, Sep 22, 2016, at 06:06, Divya Gehlot wrote:
> Hi,
> I have initialised the logging in my spark App
> */*Initialize Logging */
**val **log *= Logger.*getLogger*(getClass.getName)
>
> Logger.*getLogger*(*"org"*).setLevel(Level.*OFF*)
>
> Logger.*getLogger*(*"akka"*).setLevel(Level.*OFF*)
>
> log.warn("Some text"+Somemap.size)
>
> When I run my spark job in using spark-submit like as below spark-
> submit \ --master yarn-client \ --driver-memory 1G \ --executor-memory
> 1G \ --executor-cores 1 \ --num-executors 2 \ --class MainClass 
> /home/hadoop/Spark-assembly-
> 1.0.jar I could see the log in terminal itself
> 16/09/22 03:45:31 WARN MainClass$: SomeText  : 10
>
> When I set up this job in scheduler
> where I can see these logs?
>
> Thanks,
> Divya
>


Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Jörn Franke
this is probably the best way to manage it

On Thu, Sep 22, 2016 at 6:42 PM, Josh Rosen 
wrote:

> Spark SQL / Tungsten's explicitly-managed off-heap memory will be capped
> at spark.memory.offHeap.size bytes. This is purposely specified as an
> absolute size rather than as a percentage of the heap size in order to
> allow end users to tune Spark so that its overall memory consumption stays
> within container memory limits.
>
> To use your example of a 3GB YARN container, you could configure Spark so
> that it's maximum heap size plus spark.memory.offHeap.size is smaller than
> 3GB (minus some overhead fudge-factor).
>
> On Thu, Sep 22, 2016 at 7:56 AM Sean Owen  wrote:
>
>> It's looking at the whole process's memory usage, and doesn't care
>> whether the memory is used by the heap or not within the JVM. Of
>> course, allocating memory off-heap still counts against you at the OS
>> level.
>>
>> On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel
>>  wrote:
>> > Thanks for the response Sean.
>> >
>> > But how does YARN know about the off-heap memory usage?
>> > That’s the piece that I’m missing.
>> >
>> > Thx again,
>> >
>> > -Mike
>> >
>> >> On Sep 21, 2016, at 10:09 PM, Sean Owen  wrote:
>> >>
>> >> No, Xmx only controls the maximum size of on-heap allocated memory.
>> >> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
>> >> when it can be released).
>> >>
>> >> The answer is that YARN will kill the process because it's using more
>> >> memory than it asked for. A JVM is always going to use a little
>> >> off-heap memory by itself, so setting a max heap size of 2GB means the
>> >> JVM process may use a bit more than 2GB of memory. With an off-heap
>> >> intensive app like Spark it can be a lot more.
>> >>
>> >> There's a built-in 10% overhead, so that if you ask for a 3GB executor
>> >> it will ask for 3.3GB from YARN. You can increase the overhead.
>> >>
>> >> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke 
>> wrote:
>> >>> All off-heap memory is still managed by the JVM process. If you limit
>> the
>> >>> memory of this process then you limit the memory. I think the memory
>> of the
>> >>> JVM process could be limited via the xms/xmx parameter of the JVM.
>> This can
>> >>> be configured via spark options for yarn (be aware that they are
>> different
>> >>> in cluster and client mode), but i recommend to use the spark options
>> for
>> >>> the off heap maximum.
>> >>>
>> >>> https://spark.apache.org/docs/latest/running-on-yarn.html
>> >>>
>> >>>
>> >>> On 21 Sep 2016, at 22:02, Michael Segel 
>> wrote:
>> >>>
>> >>> I’ve asked this question a couple of times from a friend who
>> didn’t know
>> >>> the answer… so I thought I would try here.
>> >>>
>> >>>
>> >>> Suppose we launch a job on a cluster (YARN) and we have set up the
>> >>> containers to be 3GB in size.
>> >>>
>> >>>
>> >>> What does that 3GB represent?
>> >>>
>> >>> I mean what happens if we end up using 2-3GB of off heap storage via
>> >>> tungsten?
>> >>> What will Spark do?
>> >>> Will it try to honor the container’s limits and throw an exception
>> or will
>> >>> it allow my job to grab that amount of memory and exceed YARN’s
>> >>> expectations since its off heap?
>> >>>
>> >>> Thx
>> >>>
>> >>> -Mike
>> >>>
>> >>> B‹CB•
>> È
>> >>> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Apache Spark JavaRDD pipe() need help

2016-09-22 Thread Jakob Odersky
Hi Shashikant,

I think you are trying to do too much at once in your helper class.
Spark's RDD API is functional, it is meant to be used by writing many
little transformations that will be distributed across a cluster.

Appart from that, `rdd.pipe` seems like a good approach. Here is the
relevant doc comment (in RDD.scala) on how to use it:

 Return an RDD created by piping elements to a forked external
process. The resulting RDD
   * is computed by executing the given process once per partition. All elements
   * of each input partition are written to a process's stdin as lines
of input separated
   * by a newline. The resulting partition consists of the process's
stdout output, with
   * each line of stdout resulting in one element of the output
partition. A process is invoked
   * even for empty partitions.
   *
   * [...]
Check the full docs here
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@pipe(command:String):org.apache.spark.rdd.RDD[String]

This is how you could use it:

productRDD=//get from cassandra
processedRDD=productsRDD.map(STEP1).map(STEP2).pipe(C binary of step 3)
STEP4 //store processed RDD

hope this gives you some pointers,

best,
--Jakob




On Thu, Sep 22, 2016 at 2:10 AM, Shashikant Kulkarni (शशिकांत
कुलकर्णी)  wrote:
> Hello Jakob,
>
> Thanks for replying. Here is a short example of what I am trying. Taking an
> example of Product column family in Cassandra just for explaining my
> requirement
>
> In Driver.java
> {
>  JavaRDD productsRdd = Get Products from Cassandra;
>  productsRdd.map(ProductHelper.processProduct());
> }
>
> in ProductHelper.java
> {
>
> public static Function processProduct() {
> return new Function< Product, Boolean>(){
> private static final long serialVersionUID = 1L;
>
> @Override
> public Boolean call(Product product) throws Exception {
> //STEP 1: Doing some processing on product object.
> //STEP 2: Now using few values of product, I need to create a string like
> "name id sku datetime"
> //STEP 3: Pass this string to my C binary file to perform some complex
> calculations and return some data
> //STEP 4: Get the return data and store it back in Cassandra DB
> }
> };
> }
> }
>
> In this ProductHelper, I cannot pass and don't want to pass sparkContext
> object as app will throw error of "task not serializable". If there is a way
> let me know.
>
> Now I am not able to achieve STEP 3 above. How can I pass a String to C
> binary and get the output back in my program. The C binary reads data from
> STDIN and outputs data to STDOUT. It is working from other part of
> application from PHP. I want to reuse the same C binary in my Apache SPARK
> application for some background processing and analysis using JavaRDD.pipe()
> API. If there is any other way let me know. This code will be executed in
> all the nodes in a cluster.
>
> Hope my requirement is now clear. How to do this?
>
> Regards,
> Shash
>
> On Thu, Sep 22, 2016 at 4:13 AM, Jakob Odersky  wrote:
>>
>> Can you provide more details? It's unclear what you're asking
>>
>> On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com
>>  wrote:
>> > Hi All,
>> >
>> > I am trying to use the JavaRDD.pipe() API.
>> >
>> > I have one object with me from the JavaRDD
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Sean Owen
I don't think I'd enable swap on a cluster. You'd rather processes
fail than grind everything to a halt. You'd buy more memory or
optimize memory before trading it for I/O.

On Thu, Sep 22, 2016 at 6:29 PM, Michael Segel
 wrote:
> Ok… gotcha… wasn’t sure that YARN just looked at the heap size allocation and 
> ignored the off heap.
>
> WRT over all OS memory… this would be one reason why I’d keep a decent amount 
> of swap around. (Maybe even putting it on a fast device like an .m2 or PCIe 
> flash drive….

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark RDD and Memory

2016-09-22 Thread Mich Talebzadeh
Hi,

unpersist works on storage memory not execution memory. So I do not think
you can flush it out of memory if you have not cached it using cache or
something like below in the first place.

s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

s.unpersist

I believe the recent versions of Spark deploy Least Recently Used
(LRU) mechanism to flush unused data out of memory much like RBMS cache
management. I know LLDAP does that.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 22 September 2016 at 18:09, Hanumath Rao Maduri 
wrote:

> Hello Aditya,
>
> After an intermediate action has been applied you might want to call
> rdd.unpersist() to let spark know that this rdd is no longer required.
>
> Thanks,
> -Hanu
>
> On Thu, Sep 22, 2016 at 7:54 AM, Aditya  co.in> wrote:
>
>> Hi,
>>
>> Suppose I have two RDDs
>> val textFile = sc.textFile("/user/emp.txt")
>> val textFile1 = sc.textFile("/user/emp1.xt")
>>
>> Later I perform a join operation on above two RDDs
>> val join = textFile.join(textFile1)
>>
>> And there are subsequent transformations without including textFile and
>> textFile1 further and an action to start the execution.
>>
>> When action is called, textFile and textFile1 will be loaded in memory
>> first. Later join will be performed and kept in memory.
>> My question is once join is there memory and is used for subsequent
>> execution, what happens to textFile and textFile1 RDDs. Are they still kept
>> in memory untill the full lineage graph is completed or is it destroyed
>> once its use is over? If it is kept in memory, is there any way I can
>> explicitly remove it from memory to free the memory?
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Jörn Franke
Well off-heap memory will be from an OS perspective be visible under the
JVM process (you see the memory consumption of the jvm process growing when
using off-heap memory). There is one exception: if there is another
process, which has not been started by the JVM and "lives" outside the JVM,
but uses IPC to communicate with the JVM. I do not assume this is for Spark
the case.

@xms/xmx you are right here, this is just about heap memory. You may be
able to limit the memory (and thus under previous described assumption) of
the jvm process by using cgroups, which needs to be thought about if this
shoudld be done.

On Thu, Sep 22, 2016 at 5:09 AM, Sean Owen  wrote:

> No, Xmx only controls the maximum size of on-heap allocated memory.
> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
> when it can be released).
>
> The answer is that YARN will kill the process because it's using more
> memory than it asked for. A JVM is always going to use a little
> off-heap memory by itself, so setting a max heap size of 2GB means the
> JVM process may use a bit more than 2GB of memory. With an off-heap
> intensive app like Spark it can be a lot more.
>
> There's a built-in 10% overhead, so that if you ask for a 3GB executor
> it will ask for 3.3GB from YARN. You can increase the overhead.
>
> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke 
> wrote:
> > All off-heap memory is still managed by the JVM process. If you limit the
> > memory of this process then you limit the memory. I think the memory of
> the
> > JVM process could be limited via the xms/xmx parameter of the JVM. This
> can
> > be configured via spark options for yarn (be aware that they are
> different
> > in cluster and client mode), but i recommend to use the spark options for
> > the off heap maximum.
> >
> > https://spark.apache.org/docs/latest/running-on-yarn.html
> >
> >
> > On 21 Sep 2016, at 22:02, Michael Segel 
> wrote:
> >
> > I’ve asked this question a couple of times from a friend who didn’t
> know
> > the answer… so I thought I would try here.
> >
> >
> > Suppose we launch a job on a cluster (YARN) and we have set up the
> > containers to be 3GB in size.
> >
> >
> > What does that 3GB represent?
> >
> > I mean what happens if we end up using 2-3GB of off heap storage via
> > tungsten?
> > What will Spark do?
> > Will it try to honor the container’s limits and throw an exception or
> will
> > it allow my job to grab that amount of memory and exceed YARN’s
> > expectations since its off heap?
> >
> > Thx
> >
> > -Mike
> >
> > B‹CB•
> È
> > [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ
>


Re: sqoop Imported and Hbase ImportTsv issue with Fled: No enum constant mapreduce.JobCounter.MB_MILLIS_MAPS

2016-09-22 Thread Mich Talebzadeh
Hi,

With the benefit of hindsight this thread should not have been posted in
this forum rather more appropriately somewhere else like the Sqoop user
group etc.

My motivation was that in all probability one would have got a speedier
response in this active forum that posting it somewhere else.

I trust that this explains it.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 22 September 2016 at 17:34, Mich Talebzadeh 
wrote:

> Hi ,
>
> I have been seeing errors at OS level when running sqoop import or hbase
> to get data into Hive and Sqoop respectively.
>
> The gist of the error is at the last line.
>
> 2016-09-22 10:49:39,472 [myid:] - INFO  [main:Job@1356] - Job
> job_1474535924802_0003 completed successfully
> 2016-09-22 10:49:39,611 [myid:] - ERROR [main:ImportTool@607] - Imported
> Failed: No enum constant org.apache.hadoop.mapreduce.Jo
> bCounter.MB_MILLIS_MAPS
>
>
> In short the first part of sqoop job  (importing RDBMS table data into
> hdfs) finishes OK, then that error comes up and sqoop stops short of
> creating and putting data in Hive table.
>
> With Hbase all finishes OK but you end up with the similar error
>
> Sounds like both sqoop and Hbase create java scripts that are run against
> MapReduce.
>
> Now sqoop compiles java with the available JDK (in my case java version
> "1.8.0_77") and the produced jar file is run against Hadoop MapReduce.
>
> Writing jar file: /tmp/sqoop-hduser/compile/b3ed391a517259ba2d2434e6f6ee35
> 42/QueryResult.jar
>
> So it suggests that there is incompatibility between Java versions between
> the one used to compile sqoop and the one used for Hadoop? As far I know
> there are the same.
>
> I was wondering how can I investigate further?
>
> I have attached the jar file
>
> thanks
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-22 Thread sagarcasual .
Hello,

I am trying to stream data out of kafka cluster (2.11_0.10.0.1) using Spark
1.6.1
I am receiving following error, and I confirmed that Topic to which I am
trying to connect exists with the data .

Any idea what could be the case?

kafka.common.UnknownTopicOrPartitionException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:102)
at
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
at
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
at
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)


Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-22 Thread andy petrella
heya, I'd say if you wanna go the spark and scala way, please yourself and
go for the Spark Notebook 
(check http://spark-notebook.io/ for pre-built distro or build your own)
hth

On Thu, Sep 22, 2016 at 12:45 AM Arif,Mubaraka 
wrote:

> we installed it but the kernel dies.
> Any clue, why ?
>
> thanks for the link :)-
>
> ~muby
>
> 
> From: Jakob Odersky [ja...@odersky.com]
> Sent: Wednesday, September 21, 2016 4:54 PM
> To: Arif,Mubaraka
> Cc: User; Toivola,Sami
> Subject: Re: Has anyone installed the scala kernel for Jupyter notebook
>
> One option would be to use Apache Toree. A quick setup guide can be
> found here
> https://urldefense.proofpoint.com/v2/url?u=https-3A__toree.incubator.apache.org_documentation_user_quick-2Dstart=CwIBaQ=RI9dKKMRNVHr9NFa7OQiQw=dUN85GiSQZVDs0gTK4x1mSiAdXTZ-7F0KzGt2fcse38=aQ-ch2WNqv83T9vSyNogXuQZ5X3hK9k6MRt7uUhtfmg=qc-mcUm9Yx0_kXIfKLy0FUmsv_pRLZyCIHI7nzLbKr0=
>
> On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka 
> wrote:
> > Has anyone installed the scala kernel for Jupyter notebook.
> >
> >
> >
> > Any blogs or steps to follow in appreciated.
> >
> >
> >
> > thanks,
> >
> > Muby
> >
> > - To
> > unsubscribe e-mail: user-unsubscr...@spark.apache.org
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
andy


Re: Open source Spark based projects

2016-09-22 Thread Sonal Goyal
https://spark-packages.org/



Thanks,
Sonal
Nube Technologies 





On Thu, Sep 22, 2016 at 3:48 PM, Sean Owen  wrote:

> https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
> and maybe related ...
> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
>
> On Thu, Sep 22, 2016 at 11:15 AM, tahirhn  wrote:
> > I am planning to write a thesis on certain aspects (i.e testing,
> performance
> > optimisation, security) of Apache Spark. I need to study some projects
> that
> > are based on Apache Spark and are available as open source.
> >
> > If you know any such project (open source Spark based project), Please
> share
> > it here. Thanks
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Hbase Connection not seraializible in Spark -> foreachrdd

2016-09-22 Thread KhajaAsmath Mohammed
Thanks Das and Ayan.

Do you have any refrences on how to create connection pool for hbase inside
foreachpartitions as mentioned in guide. In my case, I have to use kerberos
hbase cluster.

On Wed, Sep 21, 2016 at 6:39 PM, Tathagata Das 
wrote:

> http://spark.apache.org/docs/latest/streaming-programming-
> guide.html#design-patterns-for-using-foreachrdd
>
> On Wed, Sep 21, 2016 at 4:26 PM, ayan guha  wrote:
>
>> Connection object is not serialisable. You need to implement a
>> getorcreate function which would run on each executors to create hbase
>> connection locally.
>> On 22 Sep 2016 08:34, "KhajaAsmath Mohammed" 
>> wrote:
>>
>>> Hello Everyone,
>>>
>>> I am running spark application to push data from kafka. I am able to get
>>> hbase kerberos connection successfully outside of functon before calling
>>> foreachrdd on Dstream.
>>>
>>> Job fails inside foreachrdd stating that hbaseconnection object is not
>>> serialized. could you please let me now  how toresolve this.
>>>
>>> @transient val hbaseConnection=hBaseEntityManager.getConnection()
>>>
>>> appEventDStream.foreachRDD(rdd => {
>>>   if (!rdd.isEmpty()) {
>>> rdd.foreach { entity =>
>>>   {
>>>   
>>> generatePut(hBaseEntityManager,hbaseConnection,entity.getClass.getSimpleName,entity.asInstanceOf[DataPoint])
>>>
>>> }
>>>
>>> }
>>>
>>>
>>> Error is thrown exactly at connection object inside foreachRdd saying it is 
>>> not serialize. could anyone provide solution for it
>>>
>>> Asmath
>>>
>>>
>


Spark RDD and Memory

2016-09-22 Thread Aditya

Hi,

Suppose I have two RDDs
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")

Later I perform a join operation on above two RDDs
val join = textFile.join(textFile1)

And there are subsequent transformations without including textFile and 
textFile1 further and an action to start the execution.


When action is called, textFile and textFile1 will be loaded in memory 
first. Later join will be performed and kept in memory.
My question is once join is there memory and is used for subsequent 
execution, what happens to textFile and textFile1 RDDs. Are they still 
kept in memory untill the full lineage graph is completed or is it 
destroyed once its use is over? If it is kept in memory, is there any 
way I can explicitly remove it from memory to free the memory?






-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
Thanks for the response Sean. 

But how does YARN know about the off-heap memory usage? 
That’s the piece that I’m missing.

Thx again, 

-Mike

> On Sep 21, 2016, at 10:09 PM, Sean Owen  wrote:
> 
> No, Xmx only controls the maximum size of on-heap allocated memory.
> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
> when it can be released).
> 
> The answer is that YARN will kill the process because it's using more
> memory than it asked for. A JVM is always going to use a little
> off-heap memory by itself, so setting a max heap size of 2GB means the
> JVM process may use a bit more than 2GB of memory. With an off-heap
> intensive app like Spark it can be a lot more.
> 
> There's a built-in 10% overhead, so that if you ask for a 3GB executor
> it will ask for 3.3GB from YARN. You can increase the overhead.
> 
> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke  wrote:
>> All off-heap memory is still managed by the JVM process. If you limit the
>> memory of this process then you limit the memory. I think the memory of the
>> JVM process could be limited via the xms/xmx parameter of the JVM. This can
>> be configured via spark options for yarn (be aware that they are different
>> in cluster and client mode), but i recommend to use the spark options for
>> the off heap maximum.
>> 
>> https://spark.apache.org/docs/latest/running-on-yarn.html
>> 
>> 
>> On 21 Sep 2016, at 22:02, Michael Segel  wrote:
>> 
>> I’ve asked this question a couple of times from a friend who didn’t know
>> the answer… so I thought I would try here.
>> 
>> 
>> Suppose we launch a job on a cluster (YARN) and we have set up the
>> containers to be 3GB in size.
>> 
>> 
>> What does that 3GB represent?
>> 
>> I mean what happens if we end up using 2-3GB of off heap storage via
>> tungsten?
>> What will Spark do?
>> Will it try to honor the container’s limits and throw an exception or will
>> it allow my job to grab that amount of memory and exceed YARN’s
>> expectations since its off heap?
>> 
>> Thx
>> 
>> -Mike
>> 
>> B‹CB• È
>> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ



Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Sean Owen
It's looking at the whole process's memory usage, and doesn't care
whether the memory is used by the heap or not within the JVM. Of
course, allocating memory off-heap still counts against you at the OS
level.

On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel
 wrote:
> Thanks for the response Sean.
>
> But how does YARN know about the off-heap memory usage?
> That’s the piece that I’m missing.
>
> Thx again,
>
> -Mike
>
>> On Sep 21, 2016, at 10:09 PM, Sean Owen  wrote:
>>
>> No, Xmx only controls the maximum size of on-heap allocated memory.
>> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
>> when it can be released).
>>
>> The answer is that YARN will kill the process because it's using more
>> memory than it asked for. A JVM is always going to use a little
>> off-heap memory by itself, so setting a max heap size of 2GB means the
>> JVM process may use a bit more than 2GB of memory. With an off-heap
>> intensive app like Spark it can be a lot more.
>>
>> There's a built-in 10% overhead, so that if you ask for a 3GB executor
>> it will ask for 3.3GB from YARN. You can increase the overhead.
>>
>> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke  wrote:
>>> All off-heap memory is still managed by the JVM process. If you limit the
>>> memory of this process then you limit the memory. I think the memory of the
>>> JVM process could be limited via the xms/xmx parameter of the JVM. This can
>>> be configured via spark options for yarn (be aware that they are different
>>> in cluster and client mode), but i recommend to use the spark options for
>>> the off heap maximum.
>>>
>>> https://spark.apache.org/docs/latest/running-on-yarn.html
>>>
>>>
>>> On 21 Sep 2016, at 22:02, Michael Segel  wrote:
>>>
>>> I’ve asked this question a couple of times from a friend who didn’t know
>>> the answer… so I thought I would try here.
>>>
>>>
>>> Suppose we launch a job on a cluster (YARN) and we have set up the
>>> containers to be 3GB in size.
>>>
>>>
>>> What does that 3GB represent?
>>>
>>> I mean what happens if we end up using 2-3GB of off heap storage via
>>> tungsten?
>>> What will Spark do?
>>> Will it try to honor the container’s limits and throw an exception or will
>>> it allow my job to grab that amount of memory and exceed YARN’s
>>> expectations since its off heap?
>>>
>>> Thx
>>>
>>> -Mike
>>>
>>> B‹CB• È
>>> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
I would disagree. 

While you can tune the system to not over subscribe, I would rather have it hit 
swap then fail. Especially on long running jobs. 

If we look at oversubscription on Hadoop clusters which are not running HBase… 
they survive.  Its when you have things like HBase that don’t handle swap well… 
or you don’t allocate enough swap that things go boom. 

Also consider that you could move swap to something that is faster than 
spinning rust. 


> On Sep 22, 2016, at 12:44 PM, Sean Owen  wrote:
> 
> I don't think I'd enable swap on a cluster. You'd rather processes
> fail than grind everything to a halt. You'd buy more memory or
> optimize memory before trading it for I/O.
> 
> On Thu, Sep 22, 2016 at 6:29 PM, Michael Segel
>  wrote:
>> Ok… gotcha… wasn’t sure that YARN just looked at the heap size allocation 
>> and ignored the off heap.
>> 
>> WRT over all OS memory… this would be one reason why I’d keep a decent 
>> amount of swap around. (Maybe even putting it on a fast device like an .m2 
>> or PCIe flash drive….



Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-22 Thread Cody Koeninger
Do you have the ability to try using Spark 2.0 with the
streaming-kafka-0-10 connector?

I'd expect the 1.6.1 version to be compatible with kafka 0.10, but it
would be good to rule that out.

On Thu, Sep 22, 2016 at 1:37 PM, sagarcasual .  wrote:
> Hello,
>
> I am trying to stream data out of kafka cluster (2.11_0.10.0.1) using Spark
> 1.6.1
> I am receiving following error, and I confirmed that Topic to which I am
> trying to connect exists with the data .
>
> Any idea what could be the case?
>
> kafka.common.UnknownTopicOrPartitionException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at java.lang.Class.newInstance(Class.java:442)
> at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:102)
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Equivalent to --files for driver?

2016-09-22 Thread Jacek Laskowski
Hi Everett,

I'd bet on --driver-class-path (but didn't check that out myself).

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Sep 21, 2016 at 10:17 PM, Everett Anderson
 wrote:
> Hi,
>
> I'm running Spark 1.6.2 on YARN and I often use the cluster deploy mode with
> spark-submit. While the --files param is useful for getting files onto the
> cluster in the working directories of the executors, the driver's working
> directory doesn't get them.
>
> Is there some equivalent to --files for the driver program when run in this
> mode?
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



sqoop Imported and Hbase ImportTsv issue with Fled: No enum constant mapreduce.JobCounter.MB_MILLIS_MAPS

2016-09-22 Thread Mich Talebzadeh
Hi ,

I have been seeing errors at OS level when running sqoop import or hbase to
get data into Hive and Sqoop respectively.

The gist of the error is at the last line.

2016-09-22 10:49:39,472 [myid:] - INFO  [main:Job@1356] - Job
job_1474535924802_0003 completed successfully
2016-09-22 10:49:39,611 [myid:] - ERROR [main:ImportTool@607] - Imported
Failed: No enum constant org.apache.hadoop.mapreduce.
JobCounter.MB_MILLIS_MAPS


In short the first part of sqoop job  (importing RDBMS table data into
hdfs) finishes OK, then that error comes up and sqoop stops short of
creating and putting data in Hive table.

With Hbase all finishes OK but you end up with the similar error

Sounds like both sqoop and Hbase create java scripts that are run against
MapReduce.

Now sqoop compiles java with the available JDK (in my case java version
"1.8.0_77") and the produced jar file is run against Hadoop MapReduce.

Writing jar file:
/tmp/sqoop-hduser/compile/b3ed391a517259ba2d2434e6f6ee3542/QueryResult.jar

So it suggests that there is incompatibility between Java versions between
the one used to compile sqoop and the one used for Hadoop? As far I know
there are the same.

I was wondering how can I investigate further?

I have attached the jar file

thanks




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


QueryResult.java
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Josh Rosen
Spark SQL / Tungsten's explicitly-managed off-heap memory will be capped at
spark.memory.offHeap.size bytes. This is purposely specified as an absolute
size rather than as a percentage of the heap size in order to allow end
users to tune Spark so that its overall memory consumption stays within
container memory limits.

To use your example of a 3GB YARN container, you could configure Spark so
that it's maximum heap size plus spark.memory.offHeap.size is smaller than
3GB (minus some overhead fudge-factor).

On Thu, Sep 22, 2016 at 7:56 AM Sean Owen  wrote:

> It's looking at the whole process's memory usage, and doesn't care
> whether the memory is used by the heap or not within the JVM. Of
> course, allocating memory off-heap still counts against you at the OS
> level.
>
> On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel
>  wrote:
> > Thanks for the response Sean.
> >
> > But how does YARN know about the off-heap memory usage?
> > That’s the piece that I’m missing.
> >
> > Thx again,
> >
> > -Mike
> >
> >> On Sep 21, 2016, at 10:09 PM, Sean Owen  wrote:
> >>
> >> No, Xmx only controls the maximum size of on-heap allocated memory.
> >> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
> >> when it can be released).
> >>
> >> The answer is that YARN will kill the process because it's using more
> >> memory than it asked for. A JVM is always going to use a little
> >> off-heap memory by itself, so setting a max heap size of 2GB means the
> >> JVM process may use a bit more than 2GB of memory. With an off-heap
> >> intensive app like Spark it can be a lot more.
> >>
> >> There's a built-in 10% overhead, so that if you ask for a 3GB executor
> >> it will ask for 3.3GB from YARN. You can increase the overhead.
> >>
> >> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke 
> wrote:
> >>> All off-heap memory is still managed by the JVM process. If you limit
> the
> >>> memory of this process then you limit the memory. I think the memory
> of the
> >>> JVM process could be limited via the xms/xmx parameter of the JVM.
> This can
> >>> be configured via spark options for yarn (be aware that they are
> different
> >>> in cluster and client mode), but i recommend to use the spark options
> for
> >>> the off heap maximum.
> >>>
> >>> https://spark.apache.org/docs/latest/running-on-yarn.html
> >>>
> >>>
> >>> On 21 Sep 2016, at 22:02, Michael Segel 
> wrote:
> >>>
> >>> I’ve asked this question a couple of times from a friend who
> didn’t know
> >>> the answer… so I thought I would try here.
> >>>
> >>>
> >>> Suppose we launch a job on a cluster (YARN) and we have set up the
> >>> containers to be 3GB in size.
> >>>
> >>>
> >>> What does that 3GB represent?
> >>>
> >>> I mean what happens if we end up using 2-3GB of off heap storage via
> >>> tungsten?
> >>> What will Spark do?
> >>> Will it try to honor the container’s limits and throw an exception
> or will
> >>> it allow my job to grab that amount of memory and exceed YARN’s
> >>> expectations since its off heap?
> >>>
> >>> Thx
> >>>
> >>> -Mike
> >>>
> >>>
> B‹CB• È
> >>> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Is executor computing time affected by network latency?

2016-09-22 Thread gusiri
Hi,

When I increase the network latency among spark nodes, 

I see compute time (=executor computing time in Spark Web UI) also
increases. 

In the graph attached, left = latency 1ms vs right = latency 500ms.

Is there any communication between worker and driver/master even 'during'
executor computing? or any idea on this result?



 





Thank you very much in advance. 

//gusiri




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-executor-computing-time-affected-by-network-latency-tp27779.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark RDD and Memory

2016-09-22 Thread Hanumath Rao Maduri
Hello Aditya,

After an intermediate action has been applied you might want to call
rdd.unpersist() to let spark know that this rdd is no longer required.

Thanks,
-Hanu

On Thu, Sep 22, 2016 at 7:54 AM, Aditya 
wrote:

> Hi,
>
> Suppose I have two RDDs
> val textFile = sc.textFile("/user/emp.txt")
> val textFile1 = sc.textFile("/user/emp1.xt")
>
> Later I perform a join operation on above two RDDs
> val join = textFile.join(textFile1)
>
> And there are subsequent transformations without including textFile and
> textFile1 further and an action to start the execution.
>
> When action is called, textFile and textFile1 will be loaded in memory
> first. Later join will be performed and kept in memory.
> My question is once join is there memory and is used for subsequent
> execution, what happens to textFile and textFile1 RDDs. Are they still kept
> in memory untill the full lineage graph is completed or is it destroyed
> once its use is over? If it is kept in memory, is there any way I can
> explicitly remove it from memory to free the memory?
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
Ok… gotcha… wasn’t sure that YARN just looked at the heap size allocation and 
ignored the off heap. 

WRT over all OS memory… this would be one reason why I’d keep a decent amount 
of swap around. (Maybe even putting it on a fast device like an .m2 or PCIe 
flash drive…. 


> On Sep 22, 2016, at 9:56 AM, Sean Owen  wrote:
> 
> It's looking at the whole process's memory usage, and doesn't care
> whether the memory is used by the heap or not within the JVM. Of
> course, allocating memory off-heap still counts against you at the OS
> level.
> 
> On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel
>  wrote:
>> Thanks for the response Sean.
>> 
>> But how does YARN know about the off-heap memory usage?
>> That’s the piece that I’m missing.
>> 
>> Thx again,
>> 
>> -Mike
>> 
>>> On Sep 21, 2016, at 10:09 PM, Sean Owen  wrote:
>>> 
>>> No, Xmx only controls the maximum size of on-heap allocated memory.
>>> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
>>> when it can be released).
>>> 
>>> The answer is that YARN will kill the process because it's using more
>>> memory than it asked for. A JVM is always going to use a little
>>> off-heap memory by itself, so setting a max heap size of 2GB means the
>>> JVM process may use a bit more than 2GB of memory. With an off-heap
>>> intensive app like Spark it can be a lot more.
>>> 
>>> There's a built-in 10% overhead, so that if you ask for a 3GB executor
>>> it will ask for 3.3GB from YARN. You can increase the overhead.
>>> 
>>> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke  wrote:
 All off-heap memory is still managed by the JVM process. If you limit the
 memory of this process then you limit the memory. I think the memory of the
 JVM process could be limited via the xms/xmx parameter of the JVM. This can
 be configured via spark options for yarn (be aware that they are different
 in cluster and client mode), but i recommend to use the spark options for
 the off heap maximum.
 
 https://spark.apache.org/docs/latest/running-on-yarn.html
 
 
 On 21 Sep 2016, at 22:02, Michael Segel  wrote:
 
 I’ve asked this question a couple of times from a friend who didn’t 
 know
 the answer… so I thought I would try here.
 
 
 Suppose we launch a job on a cluster (YARN) and we have set up the
 containers to be 3GB in size.
 
 
 What does that 3GB represent?
 
 I mean what happens if we end up using 2-3GB of off heap storage via
 tungsten?
 What will Spark do?
 Will it try to honor the container’s limits and throw an exception or 
 will
 it allow my job to grab that amount of memory and exceed YARN’s
 expectations since its off heap?
 
 Thx
 
 -Mike
 
 B‹CB• È
 [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org