pyspark pickle error when using itertools.groupby

2016-08-04 Thread 林家銘
Hi
I wrote a map function to aggregate data in a partition, and this function
using  itertools.groupby for more than twice, then there comes the pickle
error .

Here is what I do

===Driver Code===
pair_count = df.mapPartitions(lambda iterable: pair_func_cnt(iterable))
pair_count.collection()

===Map Function ===
def pair_func_cnt(iterable):
from itertools import groupby

ls = [[1,2,3],[1,2,5],[1,3,5],[2,4,6]]
grp1 = [(k,g) for k,g in groupby(ls, lambda e: e[0])]
grp2 = [(k,g) for k,g in groupby(grp1, lambda e: e[1])]
return iter(grp2)

===Error Message===

Caused by: org.apache.spark.api.python.PythonException: Traceback
(most recent call last):
  File 
"/opt/zeppelin-0.6.0-bin-netinst/interpreter/spark/pyspark/pyspark.zip/pyspark/worker.py",
line 111, in main
process()
  File 
"/opt/zeppelin-0.6.0-bin-netinst/interpreter/spark/pyspark/pyspark.zip/pyspark/worker.py",
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/opt/zeppelin-0.6.0-bin-netinst/interpreter/spark/pyspark/pyspark.zip/pyspark/serializers.py",
line 267, in dump_stream
bytes = self.serializer.dumps(vs)
  File 
"/opt/zeppelin-0.6.0-bin-netinst/interpreter/spark/pyspark/pyspark.zip/pyspark/serializers.py",
line 415, in dumps
return pickle.dumps(obj, protocol)PicklingError: Can't pickle
: attribute lookup itertools._grouper
failed
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more


Re: Writing all values for same key to one file

2016-08-04 Thread rtijoriwala
Hi Colzer,
Thanks for the response. My main question was about writing one file per
"key" i.e. have a file with all values for a given key. So in the pseudo
code that I have above, am I opening/creating the file in the right place?.
Once the file is created and closed, I cannot append to it.

Thanks,
Ritesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-all-values-for-same-key-to-one-file-tp27455p27485.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark 2.0] Problem with Spark Thrift Server show NULL instead of showing BIGINT value

2016-08-04 Thread Chanh Le
I checked with Spark 1.6.1 it still works fine.
I also check out latest source code in Spark 2.0 branch and built and get the 
same issue.

I think because of changing API to dataset in Spark 2.0?



Regards,
Chanh


> On Aug 5, 2016, at 9:44 AM, Chanh Le  wrote:
> 
> Hi Nicholas,
> Thanks for the information. 
> How did you solve the issue? 
> Did you change the parquet file by renaming the column name? 
> I used to change the column name when I create a table in Hive without 
> changing the parquet file but it’s still showing NULL.
> The parquet files of mine quite big so anything I can do without rewriting 
> the parquet will be better.
> 
> 
> Regards,
> Chanh.
> 
> 
>> On Aug 5, 2016, at 2:24 AM, Nicholas Hakobian 
>> > > wrote:
>> 
>> Its due to the casing of the 'I' in userId. Your schema (from printSchema) 
>> names the field "userId", while your external table definition has it as 
>> "userid".
>> 
>> We've run into similar issues with external Parquet tables defined in Hive 
>> defined with lowercase only and accessing through HiveContext. You should 
>> check out this documentation as it describes how Spark handles column 
>> definitions:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion
>>  
>> 
>> 
>> 
>> Nicholas Szandor Hakobian, Ph.D.
>> Data Scientist
>> Rally Health
>> nicholas.hakob...@rallyhealth.com 
>> 
>> 
>> On Thu, Aug 4, 2016 at 4:53 AM, Chanh Le > > wrote:
>> Hi Takeshi, 
>> I already have changed the colum type into INT and String but it got the 
>> same Null values. 
>> it only happens in userid that why it so annoying.
>> 
>> thanks and regards, 
>> Chanh
>> 
>> 
>> On Aug 4, 2016 5:59 PM, "Takeshi Yamamuro" > > wrote:
>> Hi,
>> 
>> When changing the long type into int one, does the issue also happen?
>> And also, could you show more simple query to reproduce the issue?
>> 
>> // maropu
>> 
>> On Thu, Aug 4, 2016 at 7:35 PM, Chanh Le > > wrote:
>> 
>> Hi everyone,
>> 
>> I have a parquet file and it has data but when I use Spark Thrift Server to 
>> query it shows NULL for userid.
>> As you can see I can get data by Spark Scala but STS is not.
>> 
>> 
>> 
>> The file schema
>> root
>>  |-- time: string (nullable = true)
>>  |-- topic_id: integer (nullable = true)
>>  |-- interest_id: integer (nullable = true)
>>  |-- inmarket_id: integer (nullable = true)
>>  |-- os_id: integer (nullable = true)
>>  |-- browser_id: integer (nullable = true)
>>  |-- device_type: integer (nullable = true)
>>  |-- device_id: integer (nullable = true)
>>  |-- location_id: integer (nullable = true)
>>  |-- age_id: integer (nullable = true)
>>  |-- gender_id: integer (nullable = true)
>>  |-- website_id: integer (nullable = true)
>>  |-- channel_id: integer (nullable = true)
>>  |-- section_id: integer (nullable = true)
>>  |-- zone_id: integer (nullable = true)
>>  |-- placement_id: integer (nullable = true)
>>  |-- advertiser_id: integer (nullable = true)
>>  |-- campaign_id: integer (nullable = true)
>>  |-- payment_id: integer (nullable = true)
>>  |-- creative_id: integer (nullable = true)
>>  |-- audience_id: integer (nullable = true)
>>  |-- merchant_cate: integer (nullable = true)
>>  |-- ad_default: integer (nullable = true)
>>  |-- userId: long (nullable = true)
>>  |-- impression: integer (nullable = true)
>>  |-- viewable: integer (nullable = true)
>>  |-- click: integer (nullable = true)
>>  |-- click_fraud: integer (nullable = true)
>>  |-- revenue: double (nullable = true)
>>  |-- proceeds: double (nullable = true)
>>  |-- spent: double (nullable = true)
>>  |-- network_id: integer (nullable = true)
>> 
>> 
>> I create a table in Spark Thrift Server by.
>> 
>> CREATE EXTERNAL TABLE ad_cookie_report (time String, advertiser_id int, 
>> campaign_id int, payment_id int, creative_id int, website_id int, channel_id 
>> int, section_id int, zone_id int, ad_default int, placment_id int, topic_id 
>> int, interest_id int, inmarket_id int, audience_id int, os_id int, 
>> browser_id int, device_type int, device_id int, location_id int, age_id int, 
>> gender_id int, merchant_cate int, userid bigint, impression int, viewable 
>> int, click int, click_fraud int, revenue double, proceeds double, spent 
>> double, network_id integer)
>> STORED AS PARQUET LOCATION 'alluxio://master2:19998/AD_COOKIE_REPORT' <>;
>> 
>> But when I query it got all in  NULL values.
>> 
>> 0: jdbc:hive2://master1:1> select userid from ad_cookie_report limit 10;
>> +-+--+
>> | userid  |
>> +-+--+
>> | NULL|
>> | NULL|
>> | NULL|
>> | NULL  

Java and SparkSession

2016-08-04 Thread Andy Grove
>From some brief experiments using Java with Spark 2.0 it looks like Java
developers should stick to SparkContext and SQLContext rather than using
the new SparkSession API?

It would be great if someone could confirm if that is the intention or not.

Thanks,

Andy.

--

Andy Grove
Chief Architect
www.agildata.com


Re: Regression in Java RDD sortBy() in Spark 2.0

2016-08-04 Thread Andy Grove
Moments after sending this I tracked down the issue to a subsequent
transformation of .top(10) which ran without error in Spark 1.6 (but who
knows how it was sorting since the POJO doesn't implement Comparable)
whereas in Spark 2.0 it now fails if the POJO is not Comparable.

The new behavior is better for sure.

Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Thu, Aug 4, 2016 at 10:25 PM, Andy Grove  wrote:

> Hi,
>
> I have some working Java code with Spark 1.6 that I am upgrading to Spark
> 2.0
>
> I have this valid RDD:
>
> JavaRDD popSummary
>
> I want to sort using a function I provide for performing comparisons:
>
> popSummary
> .sortBy((Function) p ->
> p.getMale() * 1.0f / p.getFemale(), true, 1)
>
> The code fails at runtime with the following error.
>
> Caused by: java.lang.ClassCastException: JPopulationSummary cannot be cast
> to java.lang.Comparable
> at org.spark_project.guava.collect.NaturalOrdering.
> compare(NaturalOrdering.java:28)
> at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:
> 153)
> at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
> at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
> at org.spark_project.guava.collect.Ordering.max(Ordering.java:551)
> at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:667)
> at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
> at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$
> anonfun$30.apply(RDD.scala:1374)
> at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$
> anonfun$30.apply(RDD.scala:1371)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$
> anonfun$apply$23.apply(RDD.scala:766)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$
> anonfun$apply$23.apply(RDD.scala:766)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Even if the POJO did implement Comparable, Spark shouldn't care since I
> provided the comparator I want to sort by.
>
> Am I doing something wrong or is this a regression?
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> www.agildata.com
>
>


Regression in Java RDD sortBy() in Spark 2.0

2016-08-04 Thread Andy Grove
Hi,

I have some working Java code with Spark 1.6 that I am upgrading to Spark
2.0

I have this valid RDD:

JavaRDD popSummary

I want to sort using a function I provide for performing comparisons:

popSummary
.sortBy((Function) p -> p.getMale()
* 1.0f / p.getFemale(), true, 1)

The code fails at runtime with the following error.

Caused by: java.lang.ClassCastException: JPopulationSummary cannot be cast
to java.lang.Comparable
at
org.spark_project.guava.collect.NaturalOrdering.compare(NaturalOrdering.java:28)
at
scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
at org.spark_project.guava.collect.Ordering.max(Ordering.java:551)
at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:667)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1374)
at
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1371)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Even if the POJO did implement Comparable, Spark shouldn't care since I
provided the comparator I want to sort by.

Am I doing something wrong or is this a regression?

Thanks,

Andy.

--

Andy Grove
Chief Architect
www.agildata.com


Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread ayan guha
select * from (
select col1 as old_st,col2 as person,lead(col2) over (partition by col2
order by timestamp) next_st from main_table
) m where next_st is not null

This will give you old street to new street in one row. You can then join
to lookup table.

On Fri, Aug 5, 2016 at 12:48 PM, Divya Gehlot 
wrote:

>  based on the time stamp column
>
> On 5 August 2016 at 10:43, ayan guha  wrote:
>
>> How do you know person1 is moving from street1 to street2 and not other
>> way around? Basically, how do you ensure the order of the rows as you have
>> written them?
>>
>> On Fri, Aug 5, 2016 at 12:16 PM, Divya Gehlot 
>> wrote:
>>
>>> Hi,
>>> I am working with Spark 1.6 with scala  and using Dataframe API .
>>> I have a use case where I  need to compare two rows and add entry in the
>>> new column based on the lookup table
>>> for example :
>>> My DF looks like :
>>> col1col2  newCol1
>>> street1 person1
>>> street2  person1 area1
>>> street3 person1  area3
>>> street5 person2
>>> street6 person2  area5
>>> street7 person4
>>> street9 person4   area7
>>>
>>> loop up table looks like
>>> street1 -> street2 - area1
>>> street2 -> street 3 - area3
>>> street5 -> street6 - area5
>>> street 7-> street 9 - area 7
>>>
>>> if person moving from street 1 to street 2 then he is reaching area 1
>>>
>>>
>>> Would really appreciate the help.
>>>
>>> Thanks,
>>> Divya
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha


Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
 based on the time stamp column

On 5 August 2016 at 10:43, ayan guha  wrote:

> How do you know person1 is moving from street1 to street2 and not other
> way around? Basically, how do you ensure the order of the rows as you have
> written them?
>
> On Fri, Aug 5, 2016 at 12:16 PM, Divya Gehlot 
> wrote:
>
>> Hi,
>> I am working with Spark 1.6 with scala  and using Dataframe API .
>> I have a use case where I  need to compare two rows and add entry in the
>> new column based on the lookup table
>> for example :
>> My DF looks like :
>> col1col2  newCol1
>> street1 person1
>> street2  person1 area1
>> street3 person1  area3
>> street5 person2
>> street6 person2  area5
>> street7 person4
>> street9 person4   area7
>>
>> loop up table looks like
>> street1 -> street2 - area1
>> street2 -> street 3 - area3
>> street5 -> street6 - area5
>> street 7-> street 9 - area 7
>>
>> if person moving from street 1 to street 2 then he is reaching area 1
>>
>>
>> Would really appreciate the help.
>>
>> Thanks,
>> Divya
>>
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: [Spark 2.0] Problem with Spark Thrift Server show NULL instead of showing BIGINT value

2016-08-04 Thread Chanh Le
Hi Nicholas,
Thanks for the information. 
How did you solve the issue? 
Did you change the parquet file by renaming the column name? 
I used to change the column name when I create a table in Hive without changing 
the parquet file but it’s still showing NULL.
The parquet files of mine quite big so anything I can do without rewriting the 
parquet will be better.


Regards,
Chanh.


> On Aug 5, 2016, at 2:24 AM, Nicholas Hakobian 
>  wrote:
> 
> Its due to the casing of the 'I' in userId. Your schema (from printSchema) 
> names the field "userId", while your external table definition has it as 
> "userid".
> 
> We've run into similar issues with external Parquet tables defined in Hive 
> defined with lowercase only and accessing through HiveContext. You should 
> check out this documentation as it describes how Spark handles column 
> definitions:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion
>  
> 
> 
> 
> Nicholas Szandor Hakobian, Ph.D.
> Data Scientist
> Rally Health
> nicholas.hakob...@rallyhealth.com 
> 
> 
> On Thu, Aug 4, 2016 at 4:53 AM, Chanh Le  > wrote:
> Hi Takeshi, 
> I already have changed the colum type into INT and String but it got the same 
> Null values. 
> it only happens in userid that why it so annoying.
> 
> thanks and regards, 
> Chanh
> 
> 
> On Aug 4, 2016 5:59 PM, "Takeshi Yamamuro"  > wrote:
> Hi,
> 
> When changing the long type into int one, does the issue also happen?
> And also, could you show more simple query to reproduce the issue?
> 
> // maropu
> 
> On Thu, Aug 4, 2016 at 7:35 PM, Chanh Le  > wrote:
> 
> Hi everyone,
> 
> I have a parquet file and it has data but when I use Spark Thrift Server to 
> query it shows NULL for userid.
> As you can see I can get data by Spark Scala but STS is not.
> 
> 
> 
> The file schema
> root
>  |-- time: string (nullable = true)
>  |-- topic_id: integer (nullable = true)
>  |-- interest_id: integer (nullable = true)
>  |-- inmarket_id: integer (nullable = true)
>  |-- os_id: integer (nullable = true)
>  |-- browser_id: integer (nullable = true)
>  |-- device_type: integer (nullable = true)
>  |-- device_id: integer (nullable = true)
>  |-- location_id: integer (nullable = true)
>  |-- age_id: integer (nullable = true)
>  |-- gender_id: integer (nullable = true)
>  |-- website_id: integer (nullable = true)
>  |-- channel_id: integer (nullable = true)
>  |-- section_id: integer (nullable = true)
>  |-- zone_id: integer (nullable = true)
>  |-- placement_id: integer (nullable = true)
>  |-- advertiser_id: integer (nullable = true)
>  |-- campaign_id: integer (nullable = true)
>  |-- payment_id: integer (nullable = true)
>  |-- creative_id: integer (nullable = true)
>  |-- audience_id: integer (nullable = true)
>  |-- merchant_cate: integer (nullable = true)
>  |-- ad_default: integer (nullable = true)
>  |-- userId: long (nullable = true)
>  |-- impression: integer (nullable = true)
>  |-- viewable: integer (nullable = true)
>  |-- click: integer (nullable = true)
>  |-- click_fraud: integer (nullable = true)
>  |-- revenue: double (nullable = true)
>  |-- proceeds: double (nullable = true)
>  |-- spent: double (nullable = true)
>  |-- network_id: integer (nullable = true)
> 
> 
> I create a table in Spark Thrift Server by.
> 
> CREATE EXTERNAL TABLE ad_cookie_report (time String, advertiser_id int, 
> campaign_id int, payment_id int, creative_id int, website_id int, channel_id 
> int, section_id int, zone_id int, ad_default int, placment_id int, topic_id 
> int, interest_id int, inmarket_id int, audience_id int, os_id int, browser_id 
> int, device_type int, device_id int, location_id int, age_id int, gender_id 
> int, merchant_cate int, userid bigint, impression int, viewable int, click 
> int, click_fraud int, revenue double, proceeds double, spent double, 
> network_id integer)
> STORED AS PARQUET LOCATION 'alluxio://master2:19998/AD_COOKIE_REPORT' <>;
> 
> But when I query it got all in  NULL values.
> 
> 0: jdbc:hive2://master1:1> select userid from ad_cookie_report limit 10;
> +-+--+
> | userid  |
> +-+--+
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> +-+--+
> 10 rows selected (3.507 seconds)
> 
> How to solve the problem? Is that related to field with Uppercase?
> How to change the field name in this situation.
> 
> 
> Regards,
> Chanh
> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 



Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread ayan guha
How do you know person1 is moving from street1 to street2 and not other way
around? Basically, how do you ensure the order of the rows as you have
written them?

On Fri, Aug 5, 2016 at 12:16 PM, Divya Gehlot 
wrote:

> Hi,
> I am working with Spark 1.6 with scala  and using Dataframe API .
> I have a use case where I  need to compare two rows and add entry in the
> new column based on the lookup table
> for example :
> My DF looks like :
> col1col2  newCol1
> street1 person1
> street2  person1 area1
> street3 person1  area3
> street5 person2
> street6 person2  area5
> street7 person4
> street9 person4   area7
>
> loop up table looks like
> street1 -> street2 - area1
> street2 -> street 3 - area3
> street5 -> street6 - area5
> street 7-> street 9 - area 7
>
> if person moving from street 1 to street 2 then he is reaching area 1
>
>
> Would really appreciate the help.
>
> Thanks,
> Divya
>
>
>
>


-- 
Best Regards,
Ayan Guha


[Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
Hi,
I am working with Spark 1.6 with scala  and using Dataframe API .
I have a use case where I  need to compare two rows and add entry in the
new column based on the lookup table
for example :
My DF looks like :
col1col2  newCol1
street1 person1
street2  person1 area1
street3 person1  area3
street5 person2
street6 person2  area5
street7 person4
street9 person4   area7

loop up table looks like
street1 -> street2 - area1
street2 -> street 3 - area3
street5 -> street6 - area5
street 7-> street 9 - area 7

if person moving from street 1 to street 2 then he is reaching area 1


Would really appreciate the help.

Thanks,
Divya


Re: Writing all values for same key to one file

2016-08-04 Thread colzer
for rdd, you can use `saveAsHadoopFile` with a Custom `MultipleOutputFormat`



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-all-values-for-same-key-to-one-file-tp27455p27483.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 1.6 Streaming delay after long run

2016-08-04 Thread Chan Chor Pang

after upgrade from Spark 1.5 to 1.6(CDH 5.6.0 -> 5.7.1)
some of our streaming job getting delay after long run.

with a little invesgation, here is what i found.
- the same program have no problem with Spark 1.5
- we have two kind of streaming and only those with 
"updateStateByKey" was affected,
- cpu usage getting higher and higher over time ( with 1core@5% at 
start and 1core@100% after a week )
- data rate is alound 100 event/s, there is no chance for the cpu 
to work so hard.

- process time for a batch delay from 100ms at start to 3s after a week
- evening running the same program(for difference input data), not 
all process delay with the same scale
- no warning or error message until it delay too much and went out 
of memory

- process time of customer code seems have no problem
- memory/heap usage looks normal to me

Im suspecting the problem is comming from updateStateByKey but i cant 
trace it down


any one experience the same problem?


--
BR
Peter Chan

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



singular value decomposition in Spark ML

2016-08-04 Thread Sandy Ryza
Hi,

Is SVD or PCA in Spark ML (i.e. spark.ml parity with the mllib
RowMatrix.computeSVD API) slated for any upcoming release?

Many thanks for any guidance!
-Sandy


Re: Writing all values for same key to one file

2016-08-04 Thread ayan guha
Partition your data using the key

rdd.partitionByKey()

On Fri, Aug 5, 2016 at 10:10 AM, rtijoriwala 
wrote:

> Any recommendations? comments?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Writing-all-values-for-same-key-to-
> one-file-tp27455p27480.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


Re: Spark SQL Hive Authorization

2016-08-04 Thread arin.g
Any updates on this? I am also trying to install Ranger with Sparksql and I
have the same issue with Spark 1.6, and Ranger 0.5.4. I have used the
enable-plugin.sh script to activate the hive-ranger plugin, and verified
that all the required configuration files are in spark/conf.

Thanks,
-Arin

rmenon wrote
> Hello,
> 
> We are trying to configure Spark SQL + Ranger for hive authorization with
> no progress. 
> 
> Spark SQL is communicating with hive metastore with no authorization
> without any issues. However, all authorization setup is being ignored.
> 
> We have currently tried by setting configuring following properties:
> hive.security.authorization.manager=org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory
> hive.security.authorization.enabled=true
> hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
> 
> The following ranger specific jars have been placed in spark classpath :
> ranger-hive-plugin, ranger-plugins-common, ranger-plugins-audit, guava
> 
> Additionally, we have placed the ranger configurations
> ranger-hive-security.xml and ranger-hive-audit.xml files in spark conf.
> 
> Versions used:
> Spark - 1.6.1
> Hive -1.2.1
> Ranger - 0.5.3
> 
> We do not see Spark SQL honoring the hive.security.authorization.manager.
> Are there any suggestions of how we could configure this setup to work? 
> 
> Regards,
> Rohit





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Hive-Authorization-tp27221p27482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Writing all values for same key to one file

2016-08-04 Thread rtijoriwala
Any recommendations? comments?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-all-values-for-same-key-to-one-file-tp27455p27480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Luis Mateos
Hi Jacek,

I have not used Encoders before. Definitely this works! Thank you!

Luis


On 4 August 2016 at 18:23, Jacek Laskowski  wrote:

> On Thu, Aug 4, 2016 at 11:56 PM, luismattor  wrote:
>
> > import java.sql.Timestamp
> > case class MyProduct(t: Timestamp, a: Float)
> > val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
> > rdd.printSchema()
> >
> > The output is:
> > root
> >  |-- t: timestamp (nullable = true)
> >  |-- a: float (nullable = false)
> >
> > How can I set the timestamp column to be NOT nullable?
>
> Gotcha! :)
>
> scala> import java.sql.Timestamp
> import java.sql.Timestamp
>
> scala> case class MyProduct(t: java.sql.Timestamp, a: Float)
> defined class MyProduct
>
> scala> import org.apache.spark.sql._
> import org.apache.spark.sql._
>
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
>
> scala> import org.apache.spark.sql.catalyst.encoders._
> import org.apache.spark.sql.catalyst.encoders._
>
> scala> implicit def myEncoder: Encoder[MyProduct] =
> ExpressionEncoder[MyProduct].copy(schema = new StructType().add("t",
> "timestamp", false).add("a", "float", false))
> myEncoder: org.apache.spark.sql.Encoder[MyProduct]
>
> scala> spark.createDataset(Seq(MyProduct(new Timestamp(0),
> 10))).printSchema
> root
>  |-- t: timestamp (nullable = false)
>  |-- a: float (nullable = false)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>


Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Michael Armbrust
Nullable is an optimization for Spark SQL.  It is telling spark to not even
do an if check when accessing that field.

In this case, your data *is* nullable, because timestamp is an object in
java and you could put null there.

On Thu, Aug 4, 2016 at 2:56 PM, luismattor  wrote:

> Hi all,
>
> Consider the following case:
>
> import java.sql.Timestamp
> case class MyProduct(t: Timestamp, a: Float)
> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
> rdd.printSchema()
>
> The output is:
> root
>  |-- t: timestamp (nullable = true)
>  |-- a: float (nullable = false)
>
> How can I set the timestamp column to be NOT nullable?
>
> Regards,
> Luis
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-to-set-nullable-field-when-
> create-DataFrame-using-case-class-tp27479.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Explanation regarding Spark Streaming

2016-08-04 Thread Jacek Laskowski
On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller  wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Jacek Laskowski
On Thu, Aug 4, 2016 at 11:56 PM, luismattor  wrote:

> import java.sql.Timestamp
> case class MyProduct(t: Timestamp, a: Float)
> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
> rdd.printSchema()
>
> The output is:
> root
>  |-- t: timestamp (nullable = true)
>  |-- a: float (nullable = false)
>
> How can I set the timestamp column to be NOT nullable?

Gotcha! :)

scala> import java.sql.Timestamp
import java.sql.Timestamp

scala> case class MyProduct(t: java.sql.Timestamp, a: Float)
defined class MyProduct

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> import org.apache.spark.sql.catalyst.encoders._
import org.apache.spark.sql.catalyst.encoders._

scala> implicit def myEncoder: Encoder[MyProduct] =
ExpressionEncoder[MyProduct].copy(schema = new StructType().add("t",
"timestamp", false).add("a", "float", false))
myEncoder: org.apache.spark.sql.Encoder[MyProduct]

scala> spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))).printSchema
root
 |-- t: timestamp (nullable = false)
 |-- a: float (nullable = false)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Jacek Laskowski
On Thu, Aug 4, 2016 at 11:56 PM, luismattor  wrote:

> How can I set the timestamp column to be NOT nullable?

Hi,

Given [1] it's not possible without defining your own Encoder for
Dataset (that you use implicitly).

It'd be something as follows:

implicit def myEncoder: Encoder[MyProduct] = ???
spark.createDataset(Seq(MyProduct(new Timestamp(0), 10)))

I don't know how to create the Encoder though (lack of skills). You'd
need to use Encoders.product[MyProduct] as a guideline.

That might help -
http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset-in-spark-1-6.

[1] 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L672

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Explanation regarding Spark Streaming

2016-08-04 Thread Mich Talebzadeh
Also check spark UI streaming section for various helpful stats. by default
it runs on 4040 but can change it by setting--conf "spark.ui.port="

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 23:48, Mohammed Guller  wrote:

> The backlog will increase as time passes and eventually you will run out
> of memory.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Saurav Sinha [mailto:sauravsinh...@gmail.com]
> *Sent:* Wednesday, August 3, 2016 11:57 PM
> *To:* user
> *Subject:* Explanation regarding Spark Streaming
>
>
>
> Hi,
>
>
>
> I have query
>
>
>
> Q1. What will happen if spark streaming job have batchDurationTime as 60
> sec and processing time of complete pipeline is greater then 60 sec.
>
>
>
> --
>
> Thanks and Regards,
>
>
>
> Saurav Sinha
>
>
>
> Contact: 9742879062
>


RE: Explanation regarding Spark Streaming

2016-08-04 Thread Mohammed Guller
The backlog will increase as time passes and eventually you will run out of 
memory.

Mohammed
Author: Big Data Analytics with 
Spark

From: Saurav Sinha [mailto:sauravsinh...@gmail.com]
Sent: Wednesday, August 3, 2016 11:57 PM
To: user
Subject: Explanation regarding Spark Streaming

Hi,

I have query

Q1. What will happen if spark streaming job have batchDurationTime as 60 sec 
and processing time of complete pipeline is greater then 60 sec.

--
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Questions about ml.random forest (only one decision tree?)

2016-08-04 Thread Robin East
All supervised learning algorithms in Spark work the same way. You provide a 
set of ‘features’ (X) and a corresponding label (y) as part of a pipeline and 
call the fit method on the pipeline. The output of this is a model. You can 
then provide new examples (new Xs) to a transform method on the model that will 
give you a prediction for those examples. This means that the code for running 
different algorithms often looks very similar. The details of the algorithm are 
hidden behind the fit/transform interface.

In the case of Random Forest the implementation in Spark (i.e. behind the 
interface) is to create a number of different decision tree models (often quite 
simple models) and then ensemble the results of each decision tree. You don’t 
need to ‘create’ the decision trees yourself, that is handled by the 
implementation.

Hope that helps

Robin
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 4 Aug 2016, at 09:48, 陈哲  wrote:
> 
> Hi all
>  I'm trying to use spark ml to do some prediction with random forest. By 
> reading the example code 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java
>  
> 
>  , I can only find out it's similar to 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
>  
> .
>  Is random forest algorithm suppose to use multiple decision trees to work. 
>  I'm new about spark and ml. Is there  anyone help me, maybe provide 
> example about using multiple decision trees in random forest in spark
> 
> Thanks
> Best Regards
> Patrick



How to set nullable field when create DataFrame using case class

2016-08-04 Thread luismattor
Hi all,

Consider the following case:

import java.sql.Timestamp
case class MyProduct(t: Timestamp, a: Float)
val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
rdd.printSchema()

The output is:
root
 |-- t: timestamp (nullable = true)
 |-- a: float (nullable = false)

How can I set the timestamp column to be NOT nullable?

Regards,
Luis



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Add column sum as new column in PySpark dataframe

2016-08-04 Thread Mike Metzger
This is a little ugly, but it may do what you're after -

df.withColumn('total', expr("+".join([col for col in df.columns])))

I believe this will handle null values ok, but will likely error if there
are any string columns present.


Mike



On Thu, Aug 4, 2016 at 8:41 AM, Javier Rey  wrote:

> Hi everybody,
>
> Sorry, I sent last mesage it was imcomplete this is complete:
>
> I'm using PySpark and I have a Spark dataframe with a bunch of numeric
> columns. I want to add a column that is the sum of all the other columns.
>
> Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
>
> df.withColumn('total_col', df.a + df.b + df.c)
>
> The problem is that I don't want to type out each column individually and
> add them, especially if I have a lot of columns. I want to be able to do
> this automatically or by specifying a list of column names that I want to
> add. Is there another way to do this?
>
> I find this solution:
>
> df.withColumn('total', sum(df[col] for col in df.columns))
>
> But I get this error:
>
> "AttributeError: 'generator' object has no attribute '_get_object_id"
>
> Additionally I want to sum onlt not nulls values.
>
> Thanks in advance,
>
> Samir
>


Symbol HasInputCol is inaccesible from this place

2016-08-04 Thread janardhan shetty
Version : 2.0.0-preview

import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}


class CustomTransformer(override val uid: String) extends Transformer with
HasInputCol with HasOutputCol with DefaultParamsWritableimport
org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
HasInputCol, HasOutputCol}

*Error in IntelliJ *
Symbol HasInputCol is inaccessible from this place
 similairly for HasOutputCol and DefaultParamsWritable

Any thoughts on this error as it is not allowing the compile


Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Mich Talebzadeh
Yes pretty straight forward define, register and use

def cleanupCurrency (word : String) : Double = {
 word.toString.substring(1).replace(",", "").toDouble
}
sqlContext.udf.register("cleanupCurrency", cleanupCurrency(_:String))


val a = df.filter(col("Total") > "").map(p => Invoices(p(0).toString,
p(1).toString, cleanupCurrency(p(2).toString),
cleanupCurrency(p(3).toString), cleanupCurrency(p(4).toString)))

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 17:09, Nicholas Chammas 
wrote:

> No, SQLContext is not disappearing. The top-level class is replaced by
> SparkSession, but you can always get the underlying context from the
> session.
>
> You can also use SparkSession.udf.register()
> ,
> which is just a wrapper for sqlContext.registerFunction
> 
> .
> ​
>
> On Thu, Aug 4, 2016 at 12:04 PM Ben Teeuwen  wrote:
>
>> Yes, but I don’t want to use it in a select() call.
>> Either selectExpr() or spark.sql(), with the udf being called inside a
>> string.
>>
>> Now I got it to work using "sqlContext.registerFunction('
>> encodeOneHot_udf',encodeOneHot, VectorUDT())”
>> But this sqlContext approach will disappear, right? So I’m curious what
>> to use instead.
>>
>> On Aug 4, 2016, at 3:54 PM, Nicholas Chammas 
>> wrote:
>>
>> Have you looked at pyspark.sql.functions.udf and the associated examples?
>> 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen 님이 작성:
>>
>>> Hi,
>>>
>>> I’d like to use a UDF in pyspark 2.0. As in ..
>>> 
>>>
>>> def squareIt(x):
>>>   return x * x
>>>
>>> # register the function and define return type
>>> ….
>>>
>>> spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as
>>> function_result from df’)
>>>
>>> _
>>>
>>> How can I register the function? I only see registerFunction in the
>>> deprecated sqlContext at http://spark.apache.org/
>>> docs/2.0.0/api/python/pyspark.sql.html.
>>> As the ‘spark’ object unifies hiveContext and sqlContext, what is the
>>> new way to go?
>>>
>>> Ben
>>>
>>
>>


Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
fixed! after adding the option -DskipTests everything build ok.
Thanks Sean for your help

On Thu, Aug 4, 2016 at 8:18 PM, Richard Siebeling 
wrote:

> I don't see any other errors, these are the last lines of the
> make-distribution log.
> Above these lines there are no errors...
>
>
> [INFO] Building jar: /opt/mapr/spark/spark-2.0.0/
> common/network-yarn/target/spark-network-yarn_2.11-2.0.0-test-sources.jar
> [warn] /opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/
> apache/spark/api/python/PythonRDD.scala:78: class Accumulator in package
> spark is deprecated: use AccumulatorV2
> [warn] accumulator: Accumulator[JList[Array[Byte]]])
> [warn]  ^
> [warn] /opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/
> apache/spark/api/python/PythonRDD.scala:71: class Accumulator in package
> spark is deprecated: use AccumulatorV2
> [warn] private[spark] case class PythonFunction(
> [warn]   ^
> [warn] /opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/
> apache/spark/api/python/PythonRDD.scala:873: trait AccumulatorParam in
> package spark is deprecated: use AccumulatorV2
> [warn]   extends AccumulatorParam[JList[Array[Byte]]] {
> [warn]   ^
> [warn] /opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/
> apache/spark/util/AccumulatorV2.scala:459: trait AccumulableParam in
> package spark is deprecated: use AccumulatorV2
> [warn] param: org.apache.spark.AccumulableParam[R, T]) extends
> AccumulatorV2[T, R] {
> [warn] ^
> [warn] four warnings found
> [error] warning: [options] bootstrap class path not set in conjunction
> with -source 1.7
> [error] Compile failed at Aug 3, 2016 2:13:07 AM [1:12.769s]
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
>  3.850 s]
> [INFO] Spark Project Tags . SUCCESS [
>  6.053 s]
> [INFO] Spark Project Sketch ... SUCCESS [
>  9.977 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 17.696 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  8.864 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 17.485 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 19.551 s]
> [INFO] Spark Project Core . FAILURE
> [01:19 min]
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Local Library . SUCCESS [
> 19.594 s]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SUCCESS [
>  6.972 s]
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
> 12.019 s]
> [INFO] Spark Project YARN . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Project External Flume Sink .. SUCCESS [
> 13.460 s]
> [INFO] Spark Project External Flume ... SKIPPED
> [INFO] Spark Project External Flume Assembly .. SKIPPED
> [INFO] Spark Integration for Kafka 0.8  SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 02:08 min (Wall Clock)
> [INFO] Finished at: 2016-08-03T02:13:07+02:00
> [INFO] Final Memory: 54M/844M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-core_2.11: Execution
> scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> failed. CompileFailed -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1] 

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
I don't see any other errors, these are the last lines of the
make-distribution log.
Above these lines there are no errors...


[INFO] Building jar:
/opt/mapr/spark/spark-2.0.0/common/network-yarn/target/spark-network-yarn_2.11-2.0.0-test-sources.jar
[warn]
/opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:78:
class Accumulator in package spark is deprecated: use AccumulatorV2
[warn] accumulator: Accumulator[JList[Array[Byte]]])
[warn]  ^
[warn]
/opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:71:
class Accumulator in package spark is deprecated: use AccumulatorV2
[warn] private[spark] case class PythonFunction(
[warn]   ^
[warn]
/opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:873:
trait AccumulatorParam in package spark is deprecated: use AccumulatorV2
[warn]   extends AccumulatorParam[JList[Array[Byte]]] {
[warn]   ^
[warn]
/opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala:459:
trait AccumulableParam in package spark is deprecated: use AccumulatorV2
[warn] param: org.apache.spark.AccumulableParam[R, T]) extends
AccumulatorV2[T, R] {
[warn] ^
[warn] four warnings found
[error] warning: [options] bootstrap class path not set in conjunction with
-source 1.7
[error] Compile failed at Aug 3, 2016 2:13:07 AM [1:12.769s]
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [
 3.850 s]
[INFO] Spark Project Tags . SUCCESS [
 6.053 s]
[INFO] Spark Project Sketch ... SUCCESS [
 9.977 s]
[INFO] Spark Project Networking ... SUCCESS [
17.696 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
 8.864 s]
[INFO] Spark Project Unsafe ... SUCCESS [
17.485 s]
[INFO] Spark Project Launcher . SUCCESS [
19.551 s]
[INFO] Spark Project Core . FAILURE [01:19
min]
[INFO] Spark Project GraphX ... SKIPPED
[INFO] Spark Project Streaming  SKIPPED
[INFO] Spark Project Catalyst . SKIPPED
[INFO] Spark Project SQL .. SKIPPED
[INFO] Spark Project ML Local Library . SUCCESS [
19.594 s]
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SUCCESS [
 6.972 s]
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project YARN Shuffle Service . SUCCESS [
12.019 s]
[INFO] Spark Project YARN . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Project External Flume Sink .. SUCCESS [
13.460 s]
[INFO] Spark Project External Flume ... SKIPPED
[INFO] Spark Project External Flume Assembly .. SKIPPED
[INFO] Spark Integration for Kafka 0.8  SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] Spark Integration for Kafka 0.10 ... SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 02:08 min (Wall Clock)
[INFO] Finished at: 2016-08-03T02:13:07+02:00
[INFO] Final Memory: 54M/844M
[INFO]

[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
on project spark-core_2.11: Execution scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
-> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :spark-core_2.11

On Thu, Aug 4, 2016 at 6:30 PM, Sean Owen  wrote:

> That message is a warning, not error. It is just because you're cross
> compiling with Java 8. If something failed it was elsewhere.
>
>
> On Thu, 

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Sean Owen
That message is a warning, not error. It is just because you're cross
compiling with Java 8. If something failed it was elsewhere.

On Thu, Aug 4, 2016, 07:09 Richard Siebeling  wrote:

> Hi,
>
> spark 2.0 with mapr hadoop libraries was succesfully build using the
> following command:
> ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0-mapr-1602
> -DskipTests clean package
>
> However when I then try to build a runnable distribution using the
> following command
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.7
> -Dhadoop.version=2.7.0-mapr-1602
>
> It fails with the error "bootstrap class path not set in conjunction with
> -source 1.7"
> Could you please help? I do not know what this error means,
>
> thanks in advance,
> Richard
>
>
>


Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
No, SQLContext is not disappearing. The top-level class is replaced by
SparkSession, but you can always get the underlying context from the
session.

You can also use SparkSession.udf.register()
,
which is just a wrapper for sqlContext.registerFunction

.
​

On Thu, Aug 4, 2016 at 12:04 PM Ben Teeuwen  wrote:

> Yes, but I don’t want to use it in a select() call.
> Either selectExpr() or spark.sql(), with the udf being called inside a
> string.
>
> Now I got it to work using
> "sqlContext.registerFunction('encodeOneHot_udf',encodeOneHot, VectorUDT())”
> But this sqlContext approach will disappear, right? So I’m curious what to
> use instead.
>
> On Aug 4, 2016, at 3:54 PM, Nicholas Chammas 
> wrote:
>
> Have you looked at pyspark.sql.functions.udf and the associated examples?
> 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen 님이 작성:
>
>> Hi,
>>
>> I’d like to use a UDF in pyspark 2.0. As in ..
>> 
>>
>> def squareIt(x):
>>   return x * x
>>
>> # register the function and define return type
>> ….
>>
>> spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as
>> function_result from df’)
>>
>> _
>>
>> How can I register the function? I only see registerFunction in the
>> deprecated sqlContext at
>> http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html.
>> As the ‘spark’ object unifies hiveContext and sqlContext, what is the new
>> way to go?
>>
>> Ben
>>
>
>


Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Ben Teeuwen
Yes, but I don’t want to use it in a select() call. 
Either selectExpr() or spark.sql(), with the udf being called inside a string.

Now I got it to work using 
"sqlContext.registerFunction('encodeOneHot_udf',encodeOneHot, VectorUDT())”
But this sqlContext approach will disappear, right? So I’m curious what to use 
instead.

> On Aug 4, 2016, at 3:54 PM, Nicholas Chammas  
> wrote:
> 
> Have you looked at pyspark.sql.functions.udf and the associated examples?
> 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen  >님이 작성:
> Hi,
> 
> I’d like to use a UDF in pyspark 2.0. As in ..
>  
> 
> def squareIt(x):
>   return x * x
> 
> # register the function and define return type
> ….
> 
> spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as 
> function_result from df’)
> 
> _
> 
> How can I register the function? I only see registerFunction in the 
> deprecated sqlContext at 
> http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html 
> .
> As the ‘spark’ object unifies hiveContext and sqlContext, what is the new way 
> to go?
> 
> Ben



Re: Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Thanks a lot. That was my suspect.
Was is struggling me is that also in case of OR the pushdown is present is
the explain plan from hive, while effectively is not performed by the
client.

Regards

2016-08-04 15:37 GMT+02:00 Yong Zhang :

> The 2 plans look similar, but they are big difference, if you also
> consider that your source is in fact from a no-sql DB, like C*.
>
>
> The OR plan has "Filter ((id#0L = 94) || (id#0L = 2))", which means the
> filter is indeed happening on Spark side, instead of on C* side. Which
> means to fulfill your query, Spark has to load all the data back C* (Image
> your have millions of IDs), and filter most of them out, and only keep data
> with id 94 and 2. The IO is bottleneck in this case, and huge data need to
> transfer from C* to spark.
>
>
> In the other case, the ids being pushed down to C* (and in most case, the
> id is the primary key (or at least partition key)), so C* will find the
> data for these 2 ids very fast, and only return the matching data back to
> Spark, then doing the aggregation based on very small data in Spark. That
> is why your performance is big difference in these 2 cases.
>
>
> You can argue that Spark-Cassandra connector should be smarter to handle
> the "OR" case. But in general, OR is not  easy to handle, as in most cases,
> "OR" will be applied on different columns, instead of only on IDs in this
> case.
>
>
> If your query will use partition keys in C*, always use them with either
> "=" or "in". If not, then you have to wait for the data transfer from C* to
> spark. Spark + C* allow to run any ad-hoc queries, but you need to know the
> underline price paid.
>
>
> Yong
>
>
> --
> *From:* Takeshi Yamamuro 
> *Sent:* Thursday, August 4, 2016 8:18 AM
> *To:* Marco Colombo
> *Cc:* user
> *Subject:* Re: Spark SQL and number of task
>
> Seems the performance difference comes from `CassandraSourceRelation`.
> I'm not familiar with the implementation though, I guess the filter `IN`
> is pushed down
> into the datasource and the other not.
>
> You'd better off checking performance metrics in webUI.
>
> // maropu
>
> On Thu, Aug 4, 2016 at 8:41 PM, Marco Colombo  > wrote:
>
>> Ok, thanx.
>> The 2 plan are very similar
>>
>> with in condition
>> +---
>> 
>> ---+--+
>> |
>> plan   |
>> +---
>> 
>> ---+--+
>> | == Physical Plan ==
>>  |
>> | TungstenAggregate(key=[id#0L], 
>> functions=[(avg(avg#2),mode=Final,isDistinct=false)],
>> output=[id#0L,_c1#81])  |
>> | +- TungstenExchange hashpartitioning(id#0L,10), None
>>   |
>> |+- TungstenAggregate(key=[id#0L], 
>> functions=[(avg(avg#2),mode=Partial,isDistinct=false)],
>> output=[id#0L,sum#85,count#86L])|
>> |   +- Scan 
>> org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2]
>> PushedFilters: [In(id,[Ljava.lang.Object;@6f8b2e9d)]  |
>> +---
>> 
>> ---+--+
>>
>> with the or condition
>> +---
>> 
>> ---+--+
>> |
>> plan   |
>> +---
>> 
>> ---+--+
>> | == Physical Plan ==
>>  |
>> | TungstenAggregate(key=[id#0L], 
>> functions=[(avg(avg#2),mode=Final,isDistinct=false)],
>> output=[id#0L,_c1#88])  |
>> | +- TungstenExchange hashpartitioning(id#0L,10), None
>>   |
>> |+- TungstenAggregate(key=[id#0L], 
>> functions=[(avg(avg#2),mode=Partial,isDistinct=false)],
>> output=[id#0L,sum#92,count#93L])|
>> |   +- Filter ((id#0L = 94) || (id#0L = 2))
>>  |
>> |  +- Scan org.apache.spark.sql.cassandra.
>> CassandraSourceRelation@49243f65[id#0L,avg#2] PushedFilters:
>> [Or(EqualTo(id,94),EqualTo(id,2))]  |
>> +---
>> 

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
Yes agreed.It seems to be issue with mapping the text file contents to case
classes, not sure though.

On Thu, Aug 4, 2016 at 8:17 PM, $iddhe$h Divekar  wrote:

> Hi Deepak,
>
> My files are always > 50MB.
> I would think there would be a small config to overcome this.
> Tried almost everything i could after searching online.
>
> Any help from the mailing list would be appreciated.
>
> On Thu, Aug 4, 2016 at 7:43 AM, Deepak Sharma 
> wrote:
>
>> I am facing the same issue with spark 1.5.2
>> If the file size that's being processed by spark , is of size 10-12 MB ,
>> it throws out of memory .
>> But if the same file is within 5 MB limit , it runs fine.
>> I am using spark configuration with 7GB of memory and 3 cores for
>> executors in the cluster of 8 executor.
>>
>> Thanks
>> Deepak
>>
>> On 4 Aug 2016 8:04 pm, "$iddhe$h Divekar" 
>> wrote:
>>
>>> Hi,
>>>
>>> I am running spark jobs using apache oozie in yarn-client mode.
>>> My job.properties has sparkConf which gets used in workflow.xml.
>>>
>>> I have tried increasing MaxPermSize using sparkConf in job.properties
>>> but that is not resolving the issue.
>>>
>>> *sparkConf*=--verbose --driver-java-options '-XX:MaxPermSize=8192M'
>>> --conf spark.speculation=false --conf spark.hadoop.spark.sql.
>>> parquet.output.committer.class=\
>>> "org.apache.spark.sql.parquet.DirectParquetOutputCommitter" --conf
>>> spark.hadoop.mapred.output.committer.class="org.apache.hadoop.mapred.
>>> DirectFileOutputCommit\
>>> ter.class" --conf spark.hadoop.mapreduce.use.
>>> directfileoutputcommitter=true
>>>
>>> Am I missing anything ?
>>>
>>> I am seeing following errors.
>>>
>>> 2016-08-03 22:33:43,318  WARN SparkActionExecutor:523 -
>>> SERVER[ip-10-0-0-161.ec2.internal] USER[hadoop] GROUP[-] TOKEN[]
>>> APP[ApprouteOozie] JOB[031-160803180548580-oozie-oozi-W]
>>> ACTION[031-160803180548580-oozie-oozi-W@spark-approute] Launcher
>>> ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain],
>>> main() threw exception, PermGen space
>>> 2016-08-03 22:33:43,319  WARN SparkActionExecutor:523 -
>>> SERVER[ip-10-0-0-161.ec2.internal] USER[hadoop] GROUP[-] TOKEN[]
>>> APP[ApprouteOozie] JOB[031-160803180548580-oozie-oozi-W]
>>> ACTION[031-160803180548580-oozie-oozi-W@spark-approute] Launcher
>>> exception: PermGen space
>>> java.lang.OutOfMemoryError: PermGen space
>>>
>>> oozie-oozi-W@spark-approute] Launcher exception: PermGen space
>>> java.lang.OutOfMemoryError: PermGen space
>>> at java.lang.Class.getDeclaredConstructors0(Native Method)
>>> at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
>>> at java.lang.Class.getConstructor0(Class.java:2895)
>>> at java.lang.Class.newInstance(Class.java:354)
>>> at sun.reflect.MethodAccessorGenerator$1.run(
>>> MethodAccessorGenerator.java:399)
>>> at sun.reflect.MethodAccessorGenerator$1.run(
>>> MethodAccessorGenerator.java:396)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at sun.reflect.MethodAccessorGenerator.generate(
>>> MethodAccessorGenerator.java:395)
>>> at sun.reflect.MethodAccessorGenerator.
>>> generateSerializationConstructor(MethodAccessorGenerator.java:113)
>>> at sun.reflect.ReflectionFactory.newConstructorForSerialization
>>> (ReflectionFactory.java:331)
>>> at java.io.ObjectStreamClass.getSerializableConstructor(
>>> ObjectStreamClass.java:1420)
>>> at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
>>> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:497)
>>> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:472)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.io.ObjectStreamClass.(ObjectStreamClass.java:472)
>>> at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369)
>>> at java.io.ObjectOutputStream.writeObject0(
>>> ObjectOutputStream.java:1133)
>>> at java.io.ObjectOutputStream.defaultWriteFields(
>>> ObjectOutputStream.java:1547)
>>> at java.io.ObjectOutputStream.writeSerialData(
>>> ObjectOutputStream.java:1508)
>>> at java.io.ObjectOutputStream.writeOrdinaryObject(
>>> ObjectOutputStream.java:1431)
>>> at java.io.ObjectOutputStream.writeObject0(
>>> ObjectOutputStream.java:1177)
>>> at java.io.ObjectOutputStream.defaultWriteFields(
>>> ObjectOutputStream.java:1547)
>>> at java.io.ObjectOutputStream.writeSerialData(
>>> ObjectOutputStream.java:1508)
>>> at java.io.ObjectOutputStream.writeOrdinaryObject(
>>> ObjectOutputStream.java:1431)
>>> at java.io.ObjectOutputStream.writeObject0(
>>> ObjectOutputStream.java:1177)
>>> at java.io.ObjectOutputStream.defaultWriteFields(
>>> ObjectOutputStream.java:1547)
>>> at java.io.ObjectOutputStream.writeSerialData(
>>> ObjectOutputStream.java:1508)
>>> at java.io.ObjectOutputStream.writeOrdinaryObject(
>>> ObjectOutputStream.java:1431)
>>>

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread $iddhe$h Divekar
Hi Deepak,

My files are always > 50MB.
I would think there would be a small config to overcome this.
Tried almost everything i could after searching online.

Any help from the mailing list would be appreciated.

On Thu, Aug 4, 2016 at 7:43 AM, Deepak Sharma  wrote:

> I am facing the same issue with spark 1.5.2
> If the file size that's being processed by spark , is of size 10-12 MB ,
> it throws out of memory .
> But if the same file is within 5 MB limit , it runs fine.
> I am using spark configuration with 7GB of memory and 3 cores for
> executors in the cluster of 8 executor.
>
> Thanks
> Deepak
>
> On 4 Aug 2016 8:04 pm, "$iddhe$h Divekar" 
> wrote:
>
>> Hi,
>>
>> I am running spark jobs using apache oozie in yarn-client mode.
>> My job.properties has sparkConf which gets used in workflow.xml.
>>
>> I have tried increasing MaxPermSize using sparkConf in job.properties
>> but that is not resolving the issue.
>>
>> *sparkConf*=--verbose --driver-java-options '-XX:MaxPermSize=8192M'
>> --conf spark.speculation=false --conf
>> spark.hadoop.spark.sql.parquet.output.committer.class=\
>> "org.apache.spark.sql.parquet.DirectParquetOutputCommitter" --conf
>> spark.hadoop.mapred.output.committer.class="org.apache.hadoop.mapred.DirectFileOutputCommit\
>> ter.class" --conf
>> spark.hadoop.mapreduce.use.directfileoutputcommitter=true
>>
>> Am I missing anything ?
>>
>> I am seeing following errors.
>>
>> 2016-08-03 22:33:43,318  WARN SparkActionExecutor:523 -
>> SERVER[ip-10-0-0-161.ec2.internal] USER[hadoop] GROUP[-] TOKEN[]
>> APP[ApprouteOozie] JOB[031-160803180548580-oozie-oozi-W]
>> ACTION[031-160803180548580-oozie-oozi-W@spark-approute] Launcher
>> ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain],
>> main() threw exception, PermGen space
>> 2016-08-03 22:33:43,319  WARN SparkActionExecutor:523 -
>> SERVER[ip-10-0-0-161.ec2.internal] USER[hadoop] GROUP[-] TOKEN[]
>> APP[ApprouteOozie] JOB[031-160803180548580-oozie-oozi-W]
>> ACTION[031-160803180548580-oozie-oozi-W@spark-approute] Launcher
>> exception: PermGen space
>> java.lang.OutOfMemoryError: PermGen space
>>
>> oozie-oozi-W@spark-approute] Launcher exception: PermGen space
>> java.lang.OutOfMemoryError: PermGen space
>> at java.lang.Class.getDeclaredConstructors0(Native Method)
>> at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
>> at java.lang.Class.getConstructor0(Class.java:2895)
>> at java.lang.Class.newInstance(Class.java:354)
>> at
>> sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399)
>> at
>> sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:396)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at
>> sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:395)
>> at
>> sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:113)
>> at
>> sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:331)
>> at
>> java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1420)
>> at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
>> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:497)
>> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:472)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.io.ObjectStreamClass.(ObjectStreamClass.java:472)
>> at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369)
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133)
>> at
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>> at
>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>> at
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>> at
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>> at
>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>> at
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>> at
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>> at
>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>> at
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
>>
>> --
>> -$iddhi.
>>
>


-- 
-$iddhi.


Re: num-executors, executor-memory and executor-cores parameters

2016-08-04 Thread Mich Talebzadeh
This is A classic minefield of different explanation. Here we go this is
mine.

Local mode

In this mode the driver program (SparkSubmit), the resource manager and
executor all exist within the same JVM. The JVM itself is the worker
thread. All local mode jobs run independently. There is no resource policing


num-executors   --> 1

executor-memory --> You can give as much as you can afford.

executor-cores  --> will go and grab what you have specified in --master
local[n]


Standalone mode
Resources are managed by Spark resource manager itself. You start your
master and slaves/worker processes As far as I have worked it out the
following applies

num-executors --> It does not care about this. The number of
executors will be the number of workers on each node
executor-memory   --> If you have set up SPARK_WORKER_MEMORY in
spark-env.sh, this will be the memory used by the executor
executor-cores--> If you have set up SPARK_WORKER_CORES in
spark-env.sh, this will be the number of cores used by each executor
SPARK_WORKER_CORES=n ##, total number of cores to be used by executors by
each worker
SPARK_WORKER_MEMORY=mg ##, to set how much total memory workers have to
give executors (e.g. 1000m, 2g)

Yarn mode
Yarn manages resources that by far is more robust than other modes
num-executors --> This is the upper limit on the total number of
executors that you can have across the cluster, not just one node.
Yarn will decide on
the limit if it is achievable
executor-memory   --> memory allocated to each executor. Again if there
is indeed memory available
executor-cores--> this is the number of cores allocated to each
executor. To give an example a given executor can run the same code
  on a subset of data in parallel using
executor-cores tasks

It is my understanding that Yarn will divide the number of executors
uniformly across the cluster. Balancing resource management is an important
issue in Spark. Yarn does a good job, but no resource manager can stop one
specking unrealistic parameters. The job won't start or will crash.

Please correct me if any is wrong

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 14:39, Ashok Kumar  wrote:

> Hi
>
> I would like to know the exact definition for these three  parameters
>
> num-executors
> executor-memory
> executor-cores
>
> for local, standalone and yarn modes
>
> I have looked at on-line doc but not convinced if I understand them
> correct.
>
> Thanking you
>


Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
I am facing the same issue with spark 1.5.2
If the file size that's being processed by spark , is of size 10-12 MB , it
throws out of memory .
But if the same file is within 5 MB limit , it runs fine.
I am using spark configuration with 7GB of memory and 3 cores for executors
in the cluster of 8 executor.

Thanks
Deepak

On 4 Aug 2016 8:04 pm, "$iddhe$h Divekar" 
wrote:

> Hi,
>
> I am running spark jobs using apache oozie in yarn-client mode.
> My job.properties has sparkConf which gets used in workflow.xml.
>
> I have tried increasing MaxPermSize using sparkConf in job.properties
> but that is not resolving the issue.
>
> *sparkConf*=--verbose --driver-java-options '-XX:MaxPermSize=8192M'
> --conf spark.speculation=false --conf spark.hadoop.spark.sql.
> parquet.output.committer.class=\
> "org.apache.spark.sql.parquet.DirectParquetOutputCommitter" --conf
> spark.hadoop.mapred.output.committer.class="org.apache.hadoop.mapred.
> DirectFileOutputCommit\
> ter.class" --conf spark.hadoop.mapreduce.use.
> directfileoutputcommitter=true
>
> Am I missing anything ?
>
> I am seeing following errors.
>
> 2016-08-03 22:33:43,318  WARN SparkActionExecutor:523 -
> SERVER[ip-10-0-0-161.ec2.internal] USER[hadoop] GROUP[-] TOKEN[]
> APP[ApprouteOozie] JOB[031-160803180548580-oozie-oozi-W]
> ACTION[031-160803180548580-oozie-oozi-W@spark-approute] Launcher
> ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain],
> main() threw exception, PermGen space
> 2016-08-03 22:33:43,319  WARN SparkActionExecutor:523 -
> SERVER[ip-10-0-0-161.ec2.internal] USER[hadoop] GROUP[-] TOKEN[]
> APP[ApprouteOozie] JOB[031-160803180548580-oozie-oozi-W]
> ACTION[031-160803180548580-oozie-oozi-W@spark-approute] Launcher
> exception: PermGen space
> java.lang.OutOfMemoryError: PermGen space
>
> oozie-oozi-W@spark-approute] Launcher exception: PermGen space
> java.lang.OutOfMemoryError: PermGen space
> at java.lang.Class.getDeclaredConstructors0(Native Method)
> at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
> at java.lang.Class.getConstructor0(Class.java:2895)
> at java.lang.Class.newInstance(Class.java:354)
> at sun.reflect.MethodAccessorGenerator$1.run(
> MethodAccessorGenerator.java:399)
> at sun.reflect.MethodAccessorGenerator$1.run(
> MethodAccessorGenerator.java:396)
> at java.security.AccessController.doPrivileged(Native Method)
> at sun.reflect.MethodAccessorGenerator.generate(
> MethodAccessorGenerator.java:395)
> at sun.reflect.MethodAccessorGenerator.generateSerializationConstruct
> or(MethodAccessorGenerator.java:113)
> at sun.reflect.ReflectionFactory.newConstructorForSerialization
> (ReflectionFactory.java:331)
> at java.io.ObjectStreamClass.getSerializableConstructor(
> ObjectStreamClass.java:1420)
> at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:497)
> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:472)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.io.ObjectStreamClass.(ObjectStreamClass.java:472)
> at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369)
> at java.io.ObjectOutputStream.writeObject0(
> ObjectOutputStream.java:1133)
> at java.io.ObjectOutputStream.defaultWriteFields(
> ObjectOutputStream.java:1547)
> at java.io.ObjectOutputStream.writeSerialData(
> ObjectOutputStream.java:1508)
> at java.io.ObjectOutputStream.writeOrdinaryObject(
> ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(
> ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.defaultWriteFields(
> ObjectOutputStream.java:1547)
> at java.io.ObjectOutputStream.writeSerialData(
> ObjectOutputStream.java:1508)
> at java.io.ObjectOutputStream.writeOrdinaryObject(
> ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(
> ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.defaultWriteFields(
> ObjectOutputStream.java:1547)
> at java.io.ObjectOutputStream.writeSerialData(
> ObjectOutputStream.java:1508)
> at java.io.ObjectOutputStream.writeOrdinaryObject(
> ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(
> ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
> at java.io.ObjectOutputStream.writeObject0(
> ObjectOutputStream.java:1173)
>
> --
> -$iddhi.
>


Re: how to run local[k] threads on a single core

2016-08-04 Thread Daniel Darabos
You could run the application in a Docker container constrained to one CPU
with --cpuset-cpus (
https://docs.docker.com/engine/reference/run/#/cpuset-constraint).

On Thu, Aug 4, 2016 at 8:51 AM, Sun Rui  wrote:

> I don’t think it possible as Spark does not support thread to CPU affinity.
> > On Aug 4, 2016, at 14:27, sujeet jog  wrote:
> >
> > Is there a way we can run multiple tasks concurrently on a single core
> in local mode.
> >
> > for ex :- i have 5 partition ~ 5 tasks, and only a single core , i want
> these tasks to run concurrently, and specifiy them to use /run on a single
> core.
> >
> > The machine itself is say 4 core, but i want to utilize only 1 core out
> of it,.
> >
> > Is it possible ?
> >
> > Thanks,
> > Sujeet
> >
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Raleigh, Durham, and around...

2016-08-04 Thread Jean Georges Perrin
Hi,

With some friends, we try to develop the Apache Spark community in the Triangle 
area of North Carolina, USA. If you are from there, feel free to join our Slack 
team: http://oplo.io/td. Danny Siegle has also organized a lot of meet ups 
around the edX courses (see 
https://www.meetup.com/Triangle-Apache-Spark-Meetup/ 
).

Sorry for the folks not in NC, but you should come over, it's a great place to 
live :)

jg




Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread $iddhe$h Divekar
Hi,

I am running spark jobs using apache oozie in yarn-client mode.
My job.properties has sparkConf which gets used in workflow.xml.

I have tried increasing MaxPermSize using sparkConf in job.properties
but that is not resolving the issue.

*sparkConf*=--verbose --driver-java-options '-XX:MaxPermSize=8192M' --conf
spark.speculation=false --conf
spark.hadoop.spark.sql.parquet.output.committer.class=\
"org.apache.spark.sql.parquet.DirectParquetOutputCommitter" --conf
spark.hadoop.mapred.output.committer.class="org.apache.hadoop.mapred.DirectFileOutputCommit\
ter.class" --conf spark.hadoop.mapreduce.use.directfileoutputcommitter=true

Am I missing anything ?

I am seeing following errors.

2016-08-03 22:33:43,318  WARN SparkActionExecutor:523 -
SERVER[ip-10-0-0-161.ec2.internal] USER[hadoop] GROUP[-] TOKEN[]
APP[ApprouteOozie] JOB[031-160803180548580-oozie-oozi-W]
ACTION[031-160803180548580-oozie-oozi-W@spark-approute] Launcher ERROR,
reason: Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw
exception, PermGen space
2016-08-03 22:33:43,319  WARN SparkActionExecutor:523 -
SERVER[ip-10-0-0-161.ec2.internal] USER[hadoop] GROUP[-] TOKEN[]
APP[ApprouteOozie] JOB[031-160803180548580-oozie-oozi-W]
ACTION[031-160803180548580-oozie-oozi-W@spark-approute] Launcher
exception: PermGen space
java.lang.OutOfMemoryError: PermGen space

oozie-oozi-W@spark-approute] Launcher exception: PermGen space
java.lang.OutOfMemoryError: PermGen space
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
at java.lang.Class.getConstructor0(Class.java:2895)
at java.lang.Class.newInstance(Class.java:354)
at
sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399)
at
sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:396)
at java.security.AccessController.doPrivileged(Native Method)
at
sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:395)
at
sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:113)
at
sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:331)
at
java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1420)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:497)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:472)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:472)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)

-- 
-$iddhi.


WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp
hi ,
with pyspark 2.0  i get this errors

WindowsError: [Error 2] The system cannot find the file specified

someone can help me to find solution
thanks


Re: WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp
C:\Users\AppData\Local\Continuum\Anaconda2\python.exe
C:/workspacecode/pyspark/pyspark/churn/test.py
Traceback (most recent call last):
  File "C:/workspacecode/pyspark/pyspark/churn/test.py", line 5, in 
conf = SparkConf()
  File
"C:\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\python\pyspark\conf.py",
line 104, in __init__
SparkContext._ensure_initialized()
  File
"C:\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\python\pyspark\context.py",
line 243, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File
"C:\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\python\pyspark\java_gateway.py",
line 79, in launch_gateway
proc = Popen(command, stdin=PIPE, env=env)
  File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\subprocess.py", line
711, in __init__
errread, errwrite)
  File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\subprocess.py", line
959, in _execute_child
startupinfo)
WindowsError: [Error 2] Le fichier sp�cifi� est introuvable

Process finished with exit code 1

2016-08-04 16:01 GMT+02:00 pseudo oduesp :

> hi ,
> with pyspark 2.0  i get this errors
>
> WindowsError: [Error 2] The system cannot find the file specified
>
> someone can help me to find solution
> thanks
>
>


Re: Add column sum as new column in PySpark dataframe

2016-08-04 Thread Mich Talebzadeh
sorry you want the sum for each row or sum for each Colum?

assuming all rows are numeric

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 14:41, Javier Rey  wrote:

> Hi everybody,
>
> Sorry, I sent last mesage it was imcomplete this is complete:
>
> I'm using PySpark and I have a Spark dataframe with a bunch of numeric
> columns. I want to add a column that is the sum of all the other columns.
>
> Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
>
> df.withColumn('total_col', df.a + df.b + df.c)
>
> The problem is that I don't want to type out each column individually and
> add them, especially if I have a lot of columns. I want to be able to do
> this automatically or by specifying a list of column names that I want to
> add. Is there another way to do this?
>
> I find this solution:
>
> df.withColumn('total', sum(df[col] for col in df.columns))
>
> But I get this error:
>
> "AttributeError: 'generator' object has no attribute '_get_object_id"
>
> Additionally I want to sum onlt not nulls values.
>
> Thanks in advance,
>
> Samir
>


Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
Have you looked at pyspark.sql.functions.udf and the associated examples?
2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen 님이 작성:

> Hi,
>
> I’d like to use a UDF in pyspark 2.0. As in ..
> 
>
> def squareIt(x):
>   return x * x
>
> # register the function and define return type
> ….
>
> spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as
> function_result from df’)
>
> _
>
> How can I register the function? I only see registerFunction in the
> deprecated sqlContext at
> http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html.
> As the ‘spark’ object unifies hiveContext and sqlContext, what is the new
> way to go?
>
> Ben
>


Re: source code for org.spark-project.hive

2016-08-04 Thread Ted Yu
https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2

FYI

On Thu, Aug 4, 2016 at 6:23 AM, prabhat__  wrote:

> hey
> can anyone point me to the source code for the jars used with group-id
> org.spark-project.hive.
> This was previously maintained in the private repo of pwendell
> (https://github.com/pwendell/hive) which doesn't seem to be active now.
>
> where can i find the source code for group: org.spark-project.hive version:
> 1.2.1.spark2
>
>
> Thanks
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/source-code-for-org-spark-project-hive-tp27476.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: source code for org.spark-project.hive

2016-08-04 Thread Prabhat Kumar Gupta
Thanks a lot.

On Thu, Aug 4, 2016 at 7:16 PM, Ted Yu  wrote:

> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
>
> FYI
>
> On Thu, Aug 4, 2016 at 6:23 AM, prabhat__ 
> wrote:
>
>> hey
>> can anyone point me to the source code for the jars used with group-id
>> org.spark-project.hive.
>> This was previously maintained in the private repo of pwendell
>> (https://github.com/pwendell/hive) which doesn't seem to be active now.
>>
>> where can i find the source code for group: org.spark-project.hive
>> version:
>> 1.2.1.spark2
>>
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/source-code-for-org-spark-project-hive-tp27476.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
Prabhat Kumar Gupta
Ph.-9987776847


Add column sum as new column in PySpark dataframe

2016-08-04 Thread Javier Rey
Hi everybody,

Sorry, I sent last mesage it was imcomplete this is complete:

I'm using PySpark and I have a Spark dataframe with a bunch of numeric
columns. I want to add a column that is the sum of all the other columns.

Suppose my dataframe had columns "a", "b", and "c". I know I can do this:

df.withColumn('total_col', df.a + df.b + df.c)

The problem is that I don't want to type out each column individually and
add them, especially if I have a lot of columns. I want to be able to do
this automatically or by specifying a list of column names that I want to
add. Is there another way to do this?

I find this solution:

df.withColumn('total', sum(df[col] for col in df.columns))

But I get this error:

"AttributeError: 'generator' object has no attribute '_get_object_id"

Additionally I want to sum onlt not nulls values.

Thanks in advance,

Samir


Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Sure, I understand there are some issues with handling this missing value
situation in StringIndexer currently. Your workaround is not ideal but I
see that it is probably the only mechanism available currently to avoid the
problem.

But the OOM issues seem to be more about the feature cardinality (so the
size of the hashmap to store the feature <-> index mappings).

A nice property of feature hashing is that it implicitly handles unseen
category labels by setting the coefficient value to 0 (in the absence of a
hash collision) - basically option 2 from H2O.

Why is that? Well once you've trained your model you have a (sparse)
N-dimensional weight vector that will be definition have 0s for unseen
indexes. At test time, any feature that only appears in your test set or
new data will be hashed to an index in the weight vector that has value 0.

So it could be useful for both of your problems.

On Thu, 4 Aug 2016 at 15:25 Ben Teeuwen  wrote:

> Hi Nick,
>
> Thanks for the suggestion. Reducing the dimensionality is an option,
> thanks, but let’s say I really want to do this :).
>
> The reason why it’s so big is that I’m unifying my training and test data,
> and I don’t want to drop rows in the test data just because one of the
> features was missing in the training data. I wouldn’t need this
>  workaround, if I had a better *strategy in Spark for dealing with
> missing levels. *How Spark can deal with it:
>
>
> *"Additionally, there are two strategies regarding how StringIndexer will
> handle unseen labels when you have fit aStringIndexer on one dataset and
> then use it to transform another:*
>
> * • throw an exception (which is the default)*
> * • skip the row containing the unseen label entirely"*
> http://spark.apache.org/docs/2.0.0/ml-features.html#stringindexer
>
> I like how *H2O* handles this;
>
> *"What happens during prediction if the new sample has categorical levels
> not seen in training? The value will be filled with either special
> missing level (if trained with missing values and missing_value_handling
> was set to MeanImputation) or 0.”*
>
> https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/datascience/DataScienceH2O-Dev.md
>
> So assuming I need to unify the data, make it huge, and trying out more in
> scala, I see *these kinds of errors*:
> _
>
> scala> feedBack(s"Applying string indexers: fitting")
> 2016-08-04 10:13:20() | Applying string indexers: fitting
>
> scala> val pipelined = new Pipeline().setStages(stagesIndex.toArray)
> pipelined: org.apache.spark.ml.Pipeline = pipeline_83be3b554e3a
>
> scala> val dfFitted = pipelined.fit(df)
> dfFitted: org.apache.spark.ml.PipelineModel = pipeline_83be3b554e3a
>
> scala> feedBack(s"Applying string indexers: transforming")
> 2016-08-04 10:17:29() | Applying string indexers: transforming
>
> scala> var df2 = dfFitted.transform(df)
> df2: org.apache.spark.sql.DataFrame = [myid: string, feature1: int ... 16
> more fields]
>
> scala>
>
> scala> feedBack(s"Applying OHE: fitting")
> 2016-08-04 10:18:07() | Applying OHE: fitting
>
> scala> val pipelined2 = new Pipeline().setStages(stagesOhe.toArray)
> pipelined2: org.apache.spark.ml.Pipeline = pipeline_ba7922a29322
>
> scala> val dfFitted2 = pipelined2.fit(df2)
> 16/08/04 10:21:41 WARN DFSClient: Slow ReadProcessor read fields took
> 85735ms (threshold=3ms); ack: seqno: -2 status: SUCCESS status: ERROR
> downstreamAckTimeNanos: 0, targets: [10.10.66.13:50010, 10.10.95.11:50010,
> 10.10.95.29:50010]
> 16/08/04 10:21:41 WARN DFSClient: DFSOutputStream ResponseProcessor
> exception  for block
> BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993380377
> java.io.IOException: Bad response ERROR for block
> BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993380377 from
> datanode 10.10.95.11:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
> 16/08/04 10:21:41 WARN DFSClient: Error Recovery for block
> BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993380377 in
> pipeline 10.10.66.13:50010, 10.10.95.11:50010, 10.10.95.29:50010: bad
> datanode 10.10.95.11:50010
> dfFitted2: org.apache.spark.ml.PipelineModel = pipeline_ba7922a29322
>
> scala> feedBack(s"Applying OHE: transforming")
> 2016-08-04 10:29:12() | Applying OHE: transforming
>
> scala> df2 = dfFitted2.transform(df2).cache()
> 16/08/04 10:34:18 WARN DFSClient: DFSOutputStream ResponseProcessor
> exception  for block
> BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993414608
> java.io.EOFException: Premature EOF: no length prefix available
> at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)
> 16/08/04 10:34:18 WARN 

num-executors, executor-memory and executor-cores parameters

2016-08-04 Thread Ashok Kumar
Hi
I would like to know the exact definition for these three  parameters 
num-executors
executor-memory
executor-cores

for local, standalone and yarn modes

I have looked at on-line doc but not convinced if I understand them correct.
Thanking you 

Add column sum as new column in PySpark dataframe

2016-08-04 Thread Javier Rey
I'm using PySpark and I have a Spark dataframe with a bunch of numeric
columns. I want to add a column that is the sum of all the other columns.

Suppose my dataframe had columns "a", "b", and "c". I know I can do this:


Re: Spark SQL and number of task

2016-08-04 Thread Yong Zhang
The 2 plans look similar, but they are big difference, if you also consider 
that your source is in fact from a no-sql DB, like C*.


The OR plan has "Filter ((id#0L = 94) || (id#0L = 2))", which means the filter 
is indeed happening on Spark side, instead of on C* side. Which means to 
fulfill your query, Spark has to load all the data back C* (Image your have 
millions of IDs), and filter most of them out, and only keep data with id 94 
and 2. The IO is bottleneck in this case, and huge data need to transfer from 
C* to spark.


In the other case, the ids being pushed down to C* (and in most case, the id is 
the primary key (or at least partition key)), so C* will find the data for 
these 2 ids very fast, and only return the matching data back to Spark, then 
doing the aggregation based on very small data in Spark. That is why your 
performance is big difference in these 2 cases.


You can argue that Spark-Cassandra connector should be smarter to handle the 
"OR" case. But in general, OR is not  easy to handle, as in most cases, "OR" 
will be applied on different columns, instead of only on IDs in this case.


If your query will use partition keys in C*, always use them with either "=" or 
"in". If not, then you have to wait for the data transfer from C* to spark. 
Spark + C* allow to run any ad-hoc queries, but you need to know the underline 
price paid.


Yong



From: Takeshi Yamamuro 
Sent: Thursday, August 4, 2016 8:18 AM
To: Marco Colombo
Cc: user
Subject: Re: Spark SQL and number of task

Seems the performance difference comes from `CassandraSourceRelation`.
I'm not familiar with the implementation though, I guess the filter `IN` is 
pushed down
into the datasource and the other not.

You'd better off checking performance metrics in webUI.

// maropu

On Thu, Aug 4, 2016 at 8:41 PM, Marco Colombo 
> wrote:
Ok, thanx.
The 2 plan are very similar

with in condition
+--+--+
|   plan
   |
+--+--+
| == Physical Plan ==   
   |
| TungstenAggregate(key=[id#0L], 
functions=[(avg(avg#2),mode=Final,isDistinct=false)], output=[id#0L,_c1#81])
  |
| +- TungstenExchange hashpartitioning(id#0L,10), None  
   |
|+- TungstenAggregate(key=[id#0L], 
functions=[(avg(avg#2),mode=Partial,isDistinct=false)], 
output=[id#0L,sum#85,count#86L])|
|   +- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2] 
PushedFilters: [In(id,[Ljava.lang.Object;@6f8b2e9d)]  |
+--+--+

with the or condition
+--+--+
|   plan
   |
+--+--+
| == Physical Plan ==   
   |
| TungstenAggregate(key=[id#0L], 
functions=[(avg(avg#2),mode=Final,isDistinct=false)], output=[id#0L,_c1#88])
  |
| +- TungstenExchange hashpartitioning(id#0L,10), None  
   |
|+- TungstenAggregate(key=[id#0L], 
functions=[(avg(avg#2),mode=Partial,isDistinct=false)], 
output=[id#0L,sum#92,count#93L])|
|   +- Filter ((id#0L = 94) || (id#0L = 2)) 
   |
|  +- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2] 
PushedFilters: [Or(EqualTo(id,94),EqualTo(id,2))]  |
+--+--+


Filters are pushed down, so I cannot realize why it is performing a so big data 

source code for org.spark-project.hive

2016-08-04 Thread prabhat__
hey 
can anyone point me to the source code for the jars used with group-id
org.spark-project.hive.
This was previously maintained in the private repo of pwendell
(https://github.com/pwendell/hive) which doesn't seem to be active now.

where can i find the source code for group: org.spark-project.hive version:
1.2.1.spark2


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/source-code-for-org-spark-project-hive-tp27476.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
Hi Nick, 

Thanks for the suggestion. Reducing the dimensionality is an option, thanks, 
but let’s say I really want to do this :).

The reason why it’s so big is that I’m unifying my training and test data, and 
I don’t want to drop rows in the test data just because one of the features was 
missing in the training data. I wouldn’t need this  workaround, if I had a 
better strategy in Spark for dealing with missing levels. How Spark can deal 
with it:

"Additionally, there are two strategies regarding how StringIndexer will handle 
unseen labels when you have fit aStringIndexer on one dataset and then use it 
to transform another:
• throw an exception (which is the default)
• skip the row containing the unseen label entirely"
http://spark.apache.org/docs/2.0.0/ml-features.html#stringindexer 
 

I like how H2O handles this; 

"What happens during prediction if the new sample has categorical levels not 
seen in training? The value will be filled with either special missing level 
(if trained with missing values and missing_value_handling was set to 
MeanImputation) or 0.”
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/datascience/DataScienceH2O-Dev.md
 


So assuming I need to unify the data, make it huge, and trying out more in 
scala, I see these kinds of errors:
_

scala> feedBack(s"Applying string indexers: fitting")
2016-08-04 10:13:20() | Applying string indexers: fitting

scala> val pipelined = new Pipeline().setStages(stagesIndex.toArray)
pipelined: org.apache.spark.ml.Pipeline = pipeline_83be3b554e3a

scala> val dfFitted = pipelined.fit(df)
dfFitted: org.apache.spark.ml.PipelineModel = pipeline_83be3b554e3a

scala> feedBack(s"Applying string indexers: transforming")
2016-08-04 10:17:29() | Applying string indexers: transforming

scala> var df2 = dfFitted.transform(df)
df2: org.apache.spark.sql.DataFrame = [myid: string, feature1: int ... 16 more 
fields]

scala>

scala> feedBack(s"Applying OHE: fitting")
2016-08-04 10:18:07() | Applying OHE: fitting

scala> val pipelined2 = new Pipeline().setStages(stagesOhe.toArray)
pipelined2: org.apache.spark.ml.Pipeline = pipeline_ba7922a29322

scala> val dfFitted2 = pipelined2.fit(df2)
16/08/04 10:21:41 WARN DFSClient: Slow ReadProcessor read fields took 85735ms 
(threshold=3ms); ack: seqno: -2 status: SUCCESS status: ERROR 
downstreamAckTimeNanos: 0, targets: [10.10.66.13:50010, 10.10.95.11:50010, 
10.10.95.29:50010]
16/08/04 10:21:41 WARN DFSClient: DFSOutputStream ResponseProcessor exception  
for block BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993380377
java.io.IOException: Bad response ERROR for block 
BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993380377 from 
datanode 10.10.95.11:50010
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
16/08/04 10:21:41 WARN DFSClient: Error Recovery for block 
BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993380377 in 
pipeline 10.10.66.13:50010, 10.10.95.11:50010, 10.10.95.29:50010: bad datanode 
10.10.95.11:50010
dfFitted2: org.apache.spark.ml.PipelineModel = pipeline_ba7922a29322

scala> feedBack(s"Applying OHE: transforming")
2016-08-04 10:29:12() | Applying OHE: transforming

scala> df2 = dfFitted2.transform(df2).cache()
16/08/04 10:34:18 WARN DFSClient: DFSOutputStream ResponseProcessor exception  
for block BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993414608
java.io.EOFException: Premature EOF: no length prefix available
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)
16/08/04 10:34:18 WARN DFSClient: Error Recovery for block 
BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993414608 in 
pipeline 10.10.66.13:50010, 10.10.66.3:50010, 10.10.95.29:50010: bad datanode 
10.10.66.13:50010
16/08/04 10:36:03 WARN DFSClient: Slow ReadProcessor read fields took 74146ms 
(threshold=3ms); ack: seqno: -2 status: SUCCESS status: SUCCESS status: 
ERROR downstreamAckTimeNanos: 0, targets: [10.10.66.3:50010, 10.10.66.1:50010, 
10.10.95.29:50010]
16/08/04 10:36:03 WARN DFSClient: DFSOutputStream ResponseProcessor exception  
for block BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993467488
java.io.IOException: Bad response ERROR for block 
BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993467488 from 
datanode 10.10.95.29:50010
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
16/08/04 10:36:03 WARN DFSClient: Error 

registering udf to use in spark.sql('select...

2016-08-04 Thread Ben Teeuwen
Hi,

I’d like to use a UDF in pyspark 2.0. As in ..
 

def squareIt(x):
  return x * x

# register the function and define return type
….

spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as 
function_result from df’)

_

How can I register the function? I only see registerFunction in the deprecated 
sqlContext at http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html 
.
As the ‘spark’ object unifies hiveContext and sqlContext, what is the new way 
to go?

Ben

Using Spark 2.0 inside Docker

2016-08-04 Thread mhornbech
Hi

We are currently running a setup with Spark 1.6.2 inside Docker. It requires
the use of the HTTPBroadcastFactory instead of the default
TorrentBroadcastFactory to avoid the use of random ports, that cannot be
exposed through Docker. From the Spark 2.0 release notes I can see that the
HTTPBroadcast option has been removed. Are there any alternative means of
running a Spark 2.0 cluster in Docker?

Morten





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-2-0-inside-Docker-tp27475.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark SQL and number of task

2016-08-04 Thread Takeshi Yamamuro
Seems the performance difference comes from `CassandraSourceRelation`.
I'm not familiar with the implementation though, I guess the filter `IN` is
pushed down
into the datasource and the other not.

You'd better off checking performance metrics in webUI.

// maropu

On Thu, Aug 4, 2016 at 8:41 PM, Marco Colombo 
wrote:

> Ok, thanx.
> The 2 plan are very similar
>
> with in condition
>
> +--+--+
> |
> plan   |
>
> +--+--+
> | == Physical Plan ==
>  |
> | TungstenAggregate(key=[id#0L],
> functions=[(avg(avg#2),mode=Final,isDistinct=false)],
> output=[id#0L,_c1#81])  |
> | +- TungstenExchange hashpartitioning(id#0L,10), None
> |
> |+- TungstenAggregate(key=[id#0L],
> functions=[(avg(avg#2),mode=Partial,isDistinct=false)],
> output=[id#0L,sum#85,count#86L])|
> |   +- Scan
> org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2]
> PushedFilters: [In(id,[Ljava.lang.Object;@6f8b2e9d)]  |
>
> +--+--+
>
> with the or condition
>
> +--+--+
> |
> plan   |
>
> +--+--+
> | == Physical Plan ==
>  |
> | TungstenAggregate(key=[id#0L],
> functions=[(avg(avg#2),mode=Final,isDistinct=false)],
> output=[id#0L,_c1#88])  |
> | +- TungstenExchange hashpartitioning(id#0L,10), None
> |
> |+- TungstenAggregate(key=[id#0L],
> functions=[(avg(avg#2),mode=Partial,isDistinct=false)],
> output=[id#0L,sum#92,count#93L])|
> |   +- Filter ((id#0L = 94) || (id#0L = 2))
>  |
> |  +- Scan
> org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2]
> PushedFilters: [Or(EqualTo(id,94),EqualTo(id,2))]  |
>
> +--+--+
>
>
> Filters are pushed down, so I cannot realize why it is performing a so big
> data extraction in case of or. It's like a full table scan.
>
> Any advice?
>
> Thanks!
>
>
> 2016-08-04 13:25 GMT+02:00 Takeshi Yamamuro :
>
>> Hi,
>>
>> Please type `sqlCtx.sql("select *  ").explain` to show execution
>> plans.
>> Also, you can kill jobs from webUI.
>>
>> // maropu
>>
>>
>> On Thu, Aug 4, 2016 at 4:58 PM, Marco Colombo <
>> ing.marco.colo...@gmail.com> wrote:
>>
>>> Hi all, I've a question on how hive+spark are handling data.
>>>
>>> I've started a new HiveContext and I'm extracting data from cassandra.
>>> I've configured spark.sql.shuffle.partitions=10.
>>> Now, I've following query:
>>>
>>> select d.id, avg(d.avg) from v_points d where id=90 group by id;
>>>
>>> I see that 10 task are submitted and execution is fast. Every id on that
>>> table has 2000 samples.
>>>
>>> But if I just add a new id, as:
>>>
>>> select d.id, avg(d.avg) from v_points d where id=90 or id=2 group by id;
>>>
>>> it adds 663 task and query does not end.
>>>
>>> If I write query with in () like
>>>
>>> select d.id, avg(d.avg) from v_points d where id in (90,2) group by id;
>>>
>>> query is again fast.
>>>
>>> How can I get the 'execution plan' of the query?
>>>
>>> And also, how can I kill the long running submitted tasks?
>>>
>>> Thanks all!
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Ing. Marco Colombo
>



-- 
---
Takeshi Yamamuro


How to avoid sql injection on SparkSQL?

2016-08-04 Thread Linyuxin
Hi All,
I want to know how to avoid sql injection on SparkSQL
Is there any common pattern about this?
e.g. some useful tool or code segment

or just create a “wheel” on SparkSQL myself.

Thanks.


Re: Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Ok, thanx.
The 2 plan are very similar

with in condition
+--+--+
|
plan   |
+--+--+
| == Physical Plan ==
   |
| TungstenAggregate(key=[id#0L],
functions=[(avg(avg#2),mode=Final,isDistinct=false)],
output=[id#0L,_c1#81])  |
| +- TungstenExchange hashpartitioning(id#0L,10), None
|
|+- TungstenAggregate(key=[id#0L],
functions=[(avg(avg#2),mode=Partial,isDistinct=false)],
output=[id#0L,sum#85,count#86L])|
|   +- Scan
org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2]
PushedFilters: [In(id,[Ljava.lang.Object;@6f8b2e9d)]  |
+--+--+

with the or condition
+--+--+
|
plan   |
+--+--+
| == Physical Plan ==
   |
| TungstenAggregate(key=[id#0L],
functions=[(avg(avg#2),mode=Final,isDistinct=false)],
output=[id#0L,_c1#88])  |
| +- TungstenExchange hashpartitioning(id#0L,10), None
|
|+- TungstenAggregate(key=[id#0L],
functions=[(avg(avg#2),mode=Partial,isDistinct=false)],
output=[id#0L,sum#92,count#93L])|
|   +- Filter ((id#0L = 94) || (id#0L = 2))
   |
|  +- Scan
org.apache.spark.sql.cassandra.CassandraSourceRelation@49243f65[id#0L,avg#2]
PushedFilters: [Or(EqualTo(id,94),EqualTo(id,2))]  |
+--+--+


Filters are pushed down, so I cannot realize why it is performing a so big
data extraction in case of or. It's like a full table scan.

Any advice?

Thanks!


2016-08-04 13:25 GMT+02:00 Takeshi Yamamuro :

> Hi,
>
> Please type `sqlCtx.sql("select *  ").explain` to show execution plans.
> Also, you can kill jobs from webUI.
>
> // maropu
>
>
> On Thu, Aug 4, 2016 at 4:58 PM, Marco Colombo  > wrote:
>
>> Hi all, I've a question on how hive+spark are handling data.
>>
>> I've started a new HiveContext and I'm extracting data from cassandra.
>> I've configured spark.sql.shuffle.partitions=10.
>> Now, I've following query:
>>
>> select d.id, avg(d.avg) from v_points d where id=90 group by id;
>>
>> I see that 10 task are submitted and execution is fast. Every id on that
>> table has 2000 samples.
>>
>> But if I just add a new id, as:
>>
>> select d.id, avg(d.avg) from v_points d where id=90 or id=2 group by id;
>>
>> it adds 663 task and query does not end.
>>
>> If I write query with in () like
>>
>> select d.id, avg(d.avg) from v_points d where id in (90,2) group by id;
>>
>> query is again fast.
>>
>> How can I get the 'execution plan' of the query?
>>
>> And also, how can I kill the long running submitted tasks?
>>
>> Thanks all!
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Ing. Marco Colombo


Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?

2016-08-04 Thread dueckm
Hello,

I built a prototype that uses join and groupBy operations via Spark RDD API.
Recently I migrated it to the Dataset API. Now it runs much slower than with
the original RDD implementation. 
Did I do something wrong here? Or is this a price I have to pay for the more
convienient API?
Is there a known solution to deal with this effect (eg configuration via
"spark.sql.shuffle.partitions" - but now could I determine the correct
value)?
In my prototype I use Java Beans with a lot of attributes. Does this slow
down Spark-operations with Datasets?

Here I have an simple example, that shows the difference: 
JoinGroupByTest.zip

  
- I build 2 RDDs and join and group them. Afterwards I count and display the
joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
- When I do the same actions with Datasets it takes approximately 40 times
as long (Methodd e.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).

Thank you very much for your help.
Matthias

PS1: excuse me for sending this post more than once, but I am new to this
mailing list and probably did something wrong when registering/subscribing,
so my previous postings have not been accepted ...

PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
RDD implementation, jobs 2/3 to Dataset):
 

 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark SQL and number of task

2016-08-04 Thread Takeshi Yamamuro
Hi,

Please type `sqlCtx.sql("select *  ").explain` to show execution plans.
Also, you can kill jobs from webUI.

// maropu


On Thu, Aug 4, 2016 at 4:58 PM, Marco Colombo 
wrote:

> Hi all, I've a question on how hive+spark are handling data.
>
> I've started a new HiveContext and I'm extracting data from cassandra.
> I've configured spark.sql.shuffle.partitions=10.
> Now, I've following query:
>
> select d.id, avg(d.avg) from v_points d where id=90 group by id;
>
> I see that 10 task are submitted and execution is fast. Every id on that
> table has 2000 samples.
>
> But if I just add a new id, as:
>
> select d.id, avg(d.avg) from v_points d where id=90 or id=2 group by id;
>
> it adds 663 task and query does not end.
>
> If I write query with in () like
>
> select d.id, avg(d.avg) from v_points d where id in (90,2) group by id;
>
> query is again fast.
>
> How can I get the 'execution plan' of the query?
>
> And also, how can I kill the long running submitted tasks?
>
> Thanks all!
>



-- 
---
Takeshi Yamamuro


Re: SPARKSQL with HiveContext My job fails

2016-08-04 Thread Mich Talebzadeh
Well the error states


Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead
limit exceeded

Cause: The detail message "GC overhead limit exceeded" indicates that the
garbage collector is running all the time and Java program is making very
slow progress. After a garbage collection, if the Java process is spending
more than approximately 98% of its time doing garbage collection and if it
is recovering less than 2% of the heap and has been doing so far the last 5
(compile time constant) consecutive garbage collections, then a
java.lang.OutOfMemoryError is thrown. This exception is typically thrown
because the amount of live data barely fits into the Java heap having
little free space for new allocations.
Action: Increase the heap size. The java.lang.OutOfMemoryError exception
for *GC Overhead limit exceeded* can be turned off with the command line
flag -XX:-UseGCOverheadLimit.

We still don't know what the code is doing. You have not provided that
info. Are you running Spark on Yarn?. Have you checked yarn logs?


HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 10:49, Vasu Devan  wrote:

> Hi Team,
>
> My Spark job fails with below error :
>
> Could you please advice me what is the problem with my job.
>
> Below is my error stack:
>
> 16/08/04 05:11:06 ERROR ActorSystemImpl: Uncaught fatal error from thread
> [sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem
> [sparkDriver]
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
> at
> sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:388)
> at
> sun.reflect.MethodAccessorGenerator.generateMethod(MethodAccessorGenerator.java:77)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:46)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
> at
> akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
> at scala.util.Try$.apply(Try.scala:161)
> 16/08/04 05:11:06 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
> 16/08/04 05:11:07 INFO 

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Hi Ben

Perhaps with this size cardinality it is worth looking at feature hashing
for your problem. Spark has the HashingTF transformer that works on a
column of "sentences" (i.e. [string]).

For categorical features you can hack it a little by converting your
feature value into a ["feature_name=feature_value"] representation. Then
HashingTF can be used as is. Note you can also just do ["feature_value"],
but the former would allow you, with a bit of munging, to hash all your
feature columns at the same time.

The advantage is speed and bounded memory footprint. The disadvantages
include (i) no way to reverse the mapping from feature_index ->
feature_name; (ii) potential for hash collisions (can be helped a bit by
increasing your feature vector size).

Here is a minimal example:

In [1]: from pyspark.ml.feature import StringIndexer, OneHotEncoder,
HashingTF
In [2]: from pyspark.sql.types import StringType, ArrayType
In [3]: from pyspark.sql.functions import udf

In [4]: df = spark.createDataFrame([(0, "foo"), (1, "bar"), (2, "foo"), (3,
"baz")], ["id", "feature"])

In [5]: to_array = udf(lambda s: ["feature=%s" % s],
ArrayType(StringType()))

In [6]: df = df.withColumn("features", to_array("feature"))

In [7]: df.show()
+---+---+-+
| id|feature| features|
+---+---+-+
|  0|foo|[feature=foo]|
|  1|bar|[feature=bar]|
|  2|foo|[feature=foo]|
|  3|baz|[feature=baz]|
+---+---+-+

In [8]: indexer = StringIndexer(inputCol="feature",
outputCol="feature_index")

In [9]: indexed = indexer.fit(df).transform(df)

In [10]: encoder = OneHotEncoder(dropLast=False, inputCol="feature_index",
outputCol="feature_vector")

In [11]: encoded = encoder.transform(indexed)

In [12]: encoded.show()
+---+---+-+-+--+
| id|feature| features|feature_index|feature_vector|
+---+---+-+-+--+
|  0|foo|[feature=foo]|  0.0| (3,[0],[1.0])|
|  1|bar|[feature=bar]|  2.0| (3,[2],[1.0])|
|  2|foo|[feature=foo]|  0.0| (3,[0],[1.0])|
|  3|baz|[feature=baz]|  1.0| (3,[1],[1.0])|
+---+---+-+-+--+

In [22]: hasher = HashingTF(numFeatures=2**8, inputCol="features",
outputCol="features_vector")

In [23]: hashed = hasher.transform(df)

In [24]: hashed.show()
+---+---+-+-+
| id|feature| features|  features_vector|
+---+---+-+-+
|  0|foo|[feature=foo]| (256,[59],[1.0])|
|  1|bar|[feature=bar]|(256,[219],[1.0])|
|  2|foo|[feature=foo]| (256,[59],[1.0])|
|  3|baz|[feature=baz]| (256,[38],[1.0])|
+---+---+-+-+

On Thu, 4 Aug 2016 at 10:07 Ben Teeuwen  wrote:

> I raised driver memory to 30G and maxresultsize to 25G, this time in
> pyspark.
>
> *Code run:*
>
> cat_int  = ['bigfeature']
>
> stagesIndex = []
> stagesOhe   = []
> for c in cat_int:
>   stagesIndex.append(StringIndexer(inputCol=c,
> outputCol="{}Index".format(c)))
>   stagesOhe.append(OneHotEncoder(dropLast= False, inputCol =
> "{}Index".format(c), outputCol = "{}OHE".format(c)))
>
> df2 = df
>
> for i in range(len(stagesIndex)):
>   logging.info("Starting with {}".format(cat_int[i]))
>   stagesIndex[i].fit(df2)
>   logging.info("Fitted. Now transforming:")
>   df2 = stagesIndex[i].fit(df2).transform(df2)
>   logging.info("Transformed. Now showing transformed:")
>   df2.show()
>   logging.info("OHE")
>   df2 = stagesOhe[i].transform(df2)
>   logging.info("Fitted. Now showing OHE:")
>   df2.show()
>
> *Now I get error:*
>
> 2016-08-04 08:53:44,839 INFO   Starting with bigfeature
> [57/7074]
> ukStringIndexer_442b8e11e3294de9b83a
> 2016-08-04 09:06:18,147 INFO   Fitted. Now transforming:
> 16/08/04 09:10:35 WARN BlockManagerMaster: Failed to remove shuffle 3 -
> Cannot receive any reply in 120 seconds. This timeout is controlled by
> spark.rpc.askTimeout
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120
> seconds. This timeout is controlled by spark.rpc.askTimeout
> at org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
> at scala.util.Try$.apply(Try.scala:192)
> at scala.util.Failure.recover(Try.scala:216)
> at
> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
> at
> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
> at 

SPARKSQL with HiveContext My job fails

2016-08-04 Thread Vasu Devan
Hi Team,

My Spark job fails with below error :

Could you please advice me what is the problem with my job.

Below is my error stack:

16/08/04 05:11:06 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem
[sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
at
sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:388)
at
sun.reflect.MethodAccessorGenerator.generateMethod(MethodAccessorGenerator.java:77)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:46)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
at
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
16/08/04 05:11:06 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.
16/08/04 05:11:07 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.
16/08/04 05:11:07 INFO TaskSetManager: Finished task 18540.0 in stage 148.0
(TID 153058) in 190291 ms on lhrrhegapq005.enterprisenet.org (18536/32768)
16/08/04 05:11:07 INFO TaskSetManager: Finished task 18529.0 in stage 148.0
(TID 153044) in 190300 ms on lhrrhegapq008.enterprisenet.org (18537/32768)
16/08/04 05:11:07 INFO TaskSetManager: Finished task 18530.0 in stage 148.0
(TID 153049) in 190297 ms on lhrrhegapq005.enterprisenet.org (18538/32768)
16/08/04 05:11:07 INFO TaskSetManager: Finished task 18541.0 in stage 148.0
(TID 153062) in 190291 ms on lhrrhegapq006.enterprisenet.org (18539/32768)
16/08/04 05:11:09 INFO TaskSetManager: Finished task 18537.0 in stage 148.0
(TID 153057) in 191648 ms on lhrrhegapq003.enterprisenet.org (18540/32768)
16/08/04 05:11:10 INFO TaskSetManager: Finished task 18557.0 in stage 148.0
(TID 153073) in 193193 ms on lhrrhegapq003.enterprisenet.org (18541/32768)
16/08/04 05:11:10 INFO TaskSetManager: Finished task 18528.0 in stage 148.0
(TID 153045) in 193206 ms on lhrrhegapq007.enterprisenet.org (18542/32768)
16/08/04 05:11:10 INFO TaskSetManager: Finished task 18555.0 in stage 148.0
(TID 153072) in 193195 ms on lhrrhegapq002.enterprisenet.org (18543/32768)
16/08/04 05:11:10 ERROR YarnClientSchedulerBackend: Yarn application has
already exited with state FINISHED!
16/08/04 05:11:13 WARN QueuedThreadPool: 9 threads could not be stopped
16/08/04 05:11:13 INFO SparkUI: Stopped Spark web UI at
http://10.90.50.64:4043
16/08/04 05:11:15 INFO DAGScheduler: Stopping DAGScheduler
16/08/04 05:11:16 INFO DAGScheduler: Job 94 failed: save at
ndx_scala_util.scala:1264, took 232.788303 s
16/08/04 05:11:16 ERROR InsertIntoHadoopFsRelation: Aborting job.
org.apache.spark.SparkException: Job cancelled because SparkContext was
shut down
at

pycharm and pyspark on windows

2016-08-04 Thread pseudo oduesp
Hi ,
 what is good conf for pyspark and pycharm on windwos ?

tahnks


Questions about ml.random forest (only one decision tree?)

2016-08-04 Thread 陈哲
Hi all
 I'm trying to use spark ml to do some prediction with random forest.
By reading the example code
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java
,
I can only find out it's similar to
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java.
Is random forest algorithm suppose to use multiple decision trees to work.
 I'm new about spark and ml. Is there  anyone help me, maybe provide
example about using multiple decision trees in random forest in spark

Thanks
Best Regards
Patrick


Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark. 

Code run:

cat_int  = ['bigfeature']

stagesIndex = []
stagesOhe   = []
for c in cat_int:
  stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c)))
  stagesOhe.append(OneHotEncoder(dropLast= False, inputCol = 
"{}Index".format(c), outputCol = "{}OHE".format(c)))

df2 = df

for i in range(len(stagesIndex)):
  logging.info("Starting with {}".format(cat_int[i]))
  stagesIndex[i].fit(df2)
  logging.info("Fitted. Now transforming:")
  df2 = stagesIndex[i].fit(df2).transform(df2)
  logging.info("Transformed. Now showing transformed:")
  df2.show()
  logging.info("OHE")
  df2 = stagesOhe[i].transform(df2)
  logging.info("Fitted. Now showing OHE:")
  df2.show()

Now I get error:

2016-08-04 08:53:44,839 INFO   Starting with bigfeature   
[57/7074]
ukStringIndexer_442b8e11e3294de9b83a
2016-08-04 09:06:18,147 INFO   Fitted. Now transforming:
16/08/04 09:10:35 WARN BlockManagerMaster: Failed to remove shuffle 3 - Cannot 
receive any reply in 120 seconds. This timeout is controlled by 
spark.rpc.askTimeout
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
seconds. This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at 
scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 

 [13/7074]
at 

Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Hi all, I've a question on how hive+spark are handling data.

I've started a new HiveContext and I'm extracting data from cassandra.
I've configured spark.sql.shuffle.partitions=10.
Now, I've following query:

select d.id, avg(d.avg) from v_points d where id=90 group by id;

I see that 10 task are submitted and execution is fast. Every id on that
table has 2000 samples.

But if I just add a new id, as:

select d.id, avg(d.avg) from v_points d where id=90 or id=2 group by id;

it adds 663 task and query does not end.

If I write query with in () like

select d.id, avg(d.avg) from v_points d where id in (90,2) group by id;

query is again fast.

How can I get the 'execution plan' of the query?

And also, how can I kill the long running submitted tasks?

Thanks all!


Re: how to debug spark app?

2016-08-04 Thread Ben Teeuwen
Related question: what are good profiling tools other than watching along the 
application master with the running code? 
Are there things that can be logged during the run? If I have say 2 ways of 
accomplishing the same thing, and I want to learn about the time/memory/general 
resource blocking performance of both, what is the best way of doing that? What 
tic, toc does in Matlab, or profile on, profile report.

> On Aug 4, 2016, at 3:19 AM, Sumit Khanna  wrote:
> 
> Am not really sure of the best practices on this , but I either consult the 
> localhost:4040/jobs/ etc 
> or better this :
> 
> val customSparkListener: CustomSparkListener = new CustomSparkListener()
> sc.addSparkListener(customSparkListener)
> class CustomSparkListener extends SparkListener {
>  override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd) {
>   debug(s"application ended at time : ${applicationEnd.time}")
>  }
>  override def onApplicationStart(applicationStart: 
> SparkListenerApplicationStart): Unit ={
>   debug(s"[SPARK LISTENER DEBUGS] application Start app attempt id : 
> ${applicationStart.appAttemptId}")
>   debug(s"[SPARK LISTENER DEBUGS] application Start app id : 
> ${applicationStart.appId}")
>   debug(s"[SPARK LISTENER DEBUGS] application start app name : 
> ${applicationStart.appName}")
>   debug(s"[SPARK LISTENER DEBUGS] applicaton start driver logs : 
> ${applicationStart.driverLogs}")
>   debug(s"[SPARK LISTENER DEBUGS] application start spark user : 
> ${applicationStart.sparkUser}")
>   debug(s"[SPARK LISTENER DEBUGS] application start time : 
> ${applicationStart.time}")
>  }
>  override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): 
> Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.executorId}")
>   debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.executorInfo}")
>   debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.time}")
>  }
>  override  def onExecutorRemoved(executorRemoved: 
> SparkListenerExecutorRemoved): Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] the executor removed Id : 
> ${executorRemoved.executorId}")
>   debug(s"[SPARK LISTENER DEBUGS] the executor removed reason : 
> ${executorRemoved.reason}")
>   debug(s"[SPARK LISTENER DEBUGS] the executor temoved at time : 
> ${executorRemoved.time}")
>  }
> 
>  override def onJobEnd(jobEnd: SparkListenerJobEnd): Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] job End id : ${jobEnd.jobId}")
>   debug(s"[SPARK LISTENER DEBUGS] job End job Result : ${jobEnd.jobResult}")
>   debug(s"[SPARK LISTENER DEBUGS] job End time : ${jobEnd.time}")
>  }
>  override def onJobStart(jobStart: SparkListenerJobStart) {
>   debug(s"[SPARK LISTENER DEBUGS] Job started with properties 
> ${jobStart.properties}")
>   debug(s"[SPARK LISTENER DEBUGS] Job started with time ${jobStart.time}")
>   debug(s"[SPARK LISTENER DEBUGS] Job started with job id 
> ${jobStart.jobId.toString}")
>   debug(s"[SPARK LISTENER DEBUGS] Job started with stage ids 
> ${jobStart.stageIds.toString()}")
>   debug(s"[SPARK LISTENER DEBUGS] Job started with stages 
> ${jobStart.stageInfos.size} : $jobStart")
>  }
> 
>  override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): 
> Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] Stage ${stageCompleted.stageInfo.stageId} 
> completed with ${stageCompleted.stageInfo.numTasks} tasks.")
>   debug(s"[SPARK LISTENER DEBUGS] Stage details : 
> ${stageCompleted.stageInfo.details.toString}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage completion time : 
> ${stageCompleted.stageInfo.completionTime}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage details : 
> ${stageCompleted.stageInfo.rddInfos.toString()}")
>  }
>  override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): 
> Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] Stage properties : 
> ${stageSubmitted.properties}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage rddInfos : 
> ${stageSubmitted.stageInfo.rddInfos.toString()}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage submission Time : 
> ${stageSubmitted.stageInfo.submissionTime}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage submission details : 
> ${stageSubmitted.stageInfo.details.toString()}")
>  }
>  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] task ended reason : ${taskEnd.reason}")
>   debug(s"[SPARK LISTENER DEBUGS] task type : ${taskEnd.taskType}")
>   debug(s"[SPARK LISTENER DEBUGS] task Metrics : ${taskEnd.taskMetrics}")
>   debug(s"[SPARK LISTENER DEBUGS] task Info : ${taskEnd.taskInfo}")
>   debug(s"[SPARK LISTENER DEBUGS] task stage Id : ${taskEnd.stageId}")
>   debug(s"[SPARK LISTENER DEBUGS] task stage attempt Id : 
> ${taskEnd.stageAttemptId}")
>   debug(s"[SPARK LISTENER DEBUGS] task ended reason : ${taskEnd.reason}")
>  }
>  override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] stage Attempt id : 
> ${taskStart.stageAttemptId}")
>   debug(s"[SPARK 

Re: Stop Spark Streaming Jobs

2016-08-04 Thread Sandeep Nemuri
Also set spark.streaming.stopGracefullyOnShutdown to true
If true, Spark shuts down the StreamingContext gracefully on JVM shutdown
rather than immediately.

http://spark.apache.org/docs/latest/configuration.html#spark-streaming










ᐧ

On Thu, Aug 4, 2016 at 12:31 PM, Sandeep Nemuri 
wrote:

> StreamingContext.stop(...) if using scala
> JavaStreamingContext.stop(...) if using Java
>
> ᐧ
>
> On Wed, Aug 3, 2016 at 9:14 PM, Tony Lane  wrote:
>
>> SparkSession exposes stop() method
>>
>> On Wed, Aug 3, 2016 at 8:53 AM, Pradeep  wrote:
>>
>>> Thanks Park. I am doing the same. Was trying to understand if there are
>>> other ways.
>>>
>>> Thanks,
>>> Pradeep
>>>
>>> > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee 
>>> wrote:
>>> >
>>> > So sorry. Your name was Pradeep !!
>>> >
>>> > -Original Message-
>>> > From: Park Kyeong Hee [mailto:kh1979.p...@samsung.com]
>>> > Sent: Wednesday, August 03, 2016 11:24 AM
>>> > To: 'Pradeep'; 'user@spark.apache.org'
>>> > Subject: RE: Stop Spark Streaming Jobs
>>> >
>>> > Hi. Paradeep
>>> >
>>> >
>>> > Did you mean, how to kill the job?
>>> > If yes, you should kill the driver and follow next.
>>> >
>>> > on yarn-client
>>> > 1. find pid - "ps -es | grep "
>>> > 2. kill it - "kill -9 "
>>> > 3. check executors were down - "yarn application -list"
>>> >
>>> > on yarn-cluster
>>> > 1. find driver's application ID - "yarn application -list"
>>> > 2. stop it - "yarn application -kill "
>>> > 3. check driver and executors were down - "yarn application -list"
>>> >
>>> >
>>> > Thanks.
>>> >
>>> > -Original Message-
>>> > From: Pradeep [mailto:pradeep.mi...@mail.com]
>>> > Sent: Wednesday, August 03, 2016 10:48 AM
>>> > To: user@spark.apache.org
>>> > Subject: Stop Spark Streaming Jobs
>>> >
>>> > Hi All,
>>> >
>>> > My streaming job reads data from Kafka. The job is triggered and
>>> pushed to
>>> > background with nohup.
>>> >
>>> > What are the recommended ways to stop job either on yarn-client or
>>> cluster
>>> > mode.
>>> >
>>> > Thanks,
>>> > Pradeep
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>> >
>>> >
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *  Regards*
> *  Sandeep Nemuri*
>



-- 
*  Regards*
*  Sandeep Nemuri*


Re: Stop Spark Streaming Jobs

2016-08-04 Thread Sandeep Nemuri
StreamingContext.stop(...) if using scala
JavaStreamingContext.stop(...) if using Java

ᐧ

On Wed, Aug 3, 2016 at 9:14 PM, Tony Lane  wrote:

> SparkSession exposes stop() method
>
> On Wed, Aug 3, 2016 at 8:53 AM, Pradeep  wrote:
>
>> Thanks Park. I am doing the same. Was trying to understand if there are
>> other ways.
>>
>> Thanks,
>> Pradeep
>>
>> > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee 
>> wrote:
>> >
>> > So sorry. Your name was Pradeep !!
>> >
>> > -Original Message-
>> > From: Park Kyeong Hee [mailto:kh1979.p...@samsung.com]
>> > Sent: Wednesday, August 03, 2016 11:24 AM
>> > To: 'Pradeep'; 'user@spark.apache.org'
>> > Subject: RE: Stop Spark Streaming Jobs
>> >
>> > Hi. Paradeep
>> >
>> >
>> > Did you mean, how to kill the job?
>> > If yes, you should kill the driver and follow next.
>> >
>> > on yarn-client
>> > 1. find pid - "ps -es | grep "
>> > 2. kill it - "kill -9 "
>> > 3. check executors were down - "yarn application -list"
>> >
>> > on yarn-cluster
>> > 1. find driver's application ID - "yarn application -list"
>> > 2. stop it - "yarn application -kill "
>> > 3. check driver and executors were down - "yarn application -list"
>> >
>> >
>> > Thanks.
>> >
>> > -Original Message-
>> > From: Pradeep [mailto:pradeep.mi...@mail.com]
>> > Sent: Wednesday, August 03, 2016 10:48 AM
>> > To: user@spark.apache.org
>> > Subject: Stop Spark Streaming Jobs
>> >
>> > Hi All,
>> >
>> > My streaming job reads data from Kafka. The job is triggered and pushed
>> to
>> > background with nohup.
>> >
>> > What are the recommended ways to stop job either on yarn-client or
>> cluster
>> > mode.
>> >
>> > Thanks,
>> > Pradeep
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>> >
>> >
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
*  Regards*
*  Sandeep Nemuri*


Explanation regarding Spark Streaming

2016-08-04 Thread Saurav Sinha
Hi,

I have query

Q1. What will happen if spark streaming job have batchDurationTime as 60
sec and processing time of complete pipeline is greater then 60 sec.

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


How to connect Power BI to Apache Spark on local machine?

2016-08-04 Thread Devi P.V
Hi all,
I am newbie in Power BI.What are the configurations need to connect Power
BI to spark on my local machine? I found some documents that mentioned
spark over Azure's HDInsight .But didn't find any reference materials for
connecting Spark to remote machine? Is it possible?

following is the previously mentioned link that refers steps for connecting
spark over Azure's HDInsight

https://powerbi.microsoft.com/en-us/documentation/powerbi-spark-on-hdinsight-with-direct-connect/

Thanks


Re: how to run local[k] threads on a single core

2016-08-04 Thread Sun Rui
I don’t think it possible as Spark does not support thread to CPU affinity.
> On Aug 4, 2016, at 14:27, sujeet jog  wrote:
> 
> Is there a way we can run multiple tasks concurrently on a single core in 
> local mode.
> 
> for ex :- i have 5 partition ~ 5 tasks, and only a single core , i want these 
> tasks to run concurrently, and specifiy them to use /run on a single core. 
> 
> The machine itself is say 4 core, but i want to utilize only 1 core out of 
> it,. 
> 
> Is it possible ?
> 
> Thanks, 
> Sujeet
> 



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



how to run local[k] threads on a single core

2016-08-04 Thread sujeet jog
Is there a way we can run multiple tasks concurrently on a single core in
local mode.

for ex :- i have 5 partition ~ 5 tasks, and only a single core , i want
these tasks to run concurrently, and specifiy them to use /run on a single
core.

The machine itself is say 4 core, but i want to utilize only 1 core out of
it,.

Is it possible ?

Thanks,
Sujeet


Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
Hi,

spark 2.0 with mapr hadoop libraries was succesfully build using the
following command:
./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0-mapr-1602
-DskipTests clean package

However when I then try to build a runnable distribution using the
following command
./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.7
-Dhadoop.version=2.7.0-mapr-1602

It fails with the error "bootstrap class path not set in conjunction with
-source 1.7"
Could you please help? I do not know what this error means,

thanks in advance,
Richard