Aggregation of Streaming UI Statistics for multiple jobs

2018-05-26 Thread skmishra
Hi,

I am working on a streaming use case where I need to run multiple spark
streaming applications at the same time and measure the throughput and
latencies. The spark UI provides all the statistics, but if I want to run
more than 100 applications at the same time then I do not have any clue on
how to aggregate these statistics. Opening 100 windows and collecting all
the data does not seem to be an easy job. Hence, if you could provide any
help on how to collect these statistics from code, then I can write a script
to run my experiment. Any help is greatly appreciated. Thanks in advance.

Regards,
Sitakanta Mishra 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Jules Damji
Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And 
point to the blog by their contributors from Two Sigma. :-) 

β€œOn the other hand, Pandas UDF built atop Apache Arrow accords high-performance 
to Python developers, whether you use Pandas UDFs on a single-node machine or 
distributed cluster.”

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On May 26, 2018, at 12:41 PM, Corey Nolet  wrote:
> 
> Gourav & Nicholas,
> 
> THank you! It does look like the pyspark Pandas UDF is exactly what I want 
> and the article I read didn't mention that it used Arrow underneath. Looks 
> like Wes McKinney was also key part of building the Pandas UDF.
> 
> Gourav,
> 
> I totally apologize for my long and drawn out response to you. I initially 
> misunderstood your response. I also need to take the time to dive into the 
> PySpark source code- I was assuming that it was just firing up JVMs under the 
> hood.
> 
> Thanks again! I'll report back with findings. 
> 
>> On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris  wrote:
>> hi corey
>> 
>> not familiar with arrow, plasma. However recently read an article about 
>> spark on
>> a standalone machine (your case). Sounds like you could take benefit of 
>> pyspark
>> "as-is"
>> 
>> https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
>> 
>> regars,
>> 
>> 2018-05-23 22:30 GMT+02:00 Corey Nolet :
>>> Please forgive me if this question has been asked already. 
>>> 
>>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if 
>>> anyone knows of any efforts to implement the PySpark API on top of Apache 
>>> Arrow directly. In my case, I'm doing data science on a machine with 288 
>>> cores and 1TB of ram. 
>>> 
>>> It would make life much easier if I was able to use the flexibility of the 
>>> PySpark API (rather than having to be tied to the operations in Pandas). It 
>>> seems like an implementation would be fairly straightforward using the 
>>> Plasma server and object_ids. 
>>> 
>>> If you have not heard of an effort underway to accomplish this, any reasons 
>>> why it would be a bad idea?
>>> 
>>> 
>>> Thanks!
>> 
> 


Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Corey Nolet
Gourav & Nicholas,

THank you! It does look like the pyspark Pandas UDF is exactly what I want
and the article I read didn't mention that it used Arrow underneath. Looks
like Wes McKinney was also key part of building the Pandas UDF.

Gourav,

I totally apologize for my long and drawn out response to you. I initially
misunderstood your response. I also need to take the time to dive into the
PySpark source code- I was assuming that it was just firing up JVMs under
the hood.

Thanks again! I'll report back with findings.

On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris  wrote:

> hi corey
>
> not familiar with arrow, plasma. However recently read an article about
> spark on
> a standalone machine (your case). Sounds like you could take benefit of
> pyspark
> "as-is"
>
> https://databricks.com/blog/2018/05/03/benchmarking-
> apache-spark-on-a-single-node-machine.html
>
> regars,
>
> 2018-05-23 22:30 GMT+02:00 Corey Nolet :
>
>> Please forgive me if this question has been asked already.
>>
>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
>> anyone knows of any efforts to implement the PySpark API on top of Apache
>> Arrow directly. In my case, I'm doing data science on a machine with 288
>> cores and 1TB of ram.
>>
>> It would make life much easier if I was able to use the flexibility of
>> the PySpark API (rather than having to be tied to the operations in
>> Pandas). It seems like an implementation would be fairly straightforward
>> using the Plasma server and object_ids.
>>
>> If you have not heard of an effort underway to accomplish this, any
>> reasons why it would be a bad idea?
>>
>>
>> Thanks!
>>
>
>


Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Nicolas Paris
hi corey

not familiar with arrow, plasma. However recently read an article about
spark on
a standalone machine (your case). Sounds like you could take benefit of
pyspark
"as-is"

https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html

regars,

2018-05-23 22:30 GMT+02:00 Corey Nolet :

> Please forgive me if this question has been asked already.
>
> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
> anyone knows of any efforts to implement the PySpark API on top of Apache
> Arrow directly. In my case, I'm doing data science on a machine with 288
> cores and 1TB of ram.
>
> It would make life much easier if I was able to use the flexibility of the
> PySpark API (rather than having to be tied to the operations in Pandas). It
> seems like an implementation would be fairly straightforward using the
> Plasma server and object_ids.
>
> If you have not heard of an effort underway to accomplish this, any
> reasons why it would be a bad idea?
>
>
> Thanks!
>


Re: Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
I think I found the solution.

The last comment from this link -
https://issues.apache.org/jira/browse/SPARK-14948

But, my question is even after using table.column, why does Spark find the
same column name from two different tables ambiguous?

I mean table1.column = table2.column, Spark should comprehend that even
though the name of column is same but they come from two different tables,
isn't?

Well, I'll try out the solution provided above, and see if it works for me.

Thanks!

On Sat, May 26, 2018 at 9:45 PM, Aakash Basu 
wrote:

> You're right.
>
> The same set of queries are working for max 2 columns in loop.
>
> If I give more than 2 column, the 2nd column is failing with this error -
>
> *attribute(s) with the same name appear in the operation:
> marginal_adhesion_bucketed. Please check if the right attribute(s) are
> used.*
>
> Any idea on what maybe the reason?
>
> I rechecked the query, its has correct logic.
>
> On Sat, May 26, 2018 at 9:35 PM, hemant singh 
> wrote:
>
>> Per the sql plan this is where it is failing -
>>
>> Attribute(s) with the same name appear in the operation: fnlwgt_bucketed. 
>> Please check if the right attribute(s) are used.;
>>
>>
>>
>> On Sat, May 26, 2018 at 6:16 PM, Aakash Basu 
>> wrote:
>>
>>> Hi,
>>>
>>> This query is based on one step further from the query in this link
>>> .
>>> In this scenario, I add 1 or 2 more columns to be processed, Spark throws
>>> an ERROR by printing the physical plan of queries.
>>>
>>> It says, *Resolved attribute(s) fnlwgt_bucketed#152530 missing* which
>>> is untrue, as if I run the same code on less than 3 columns where this is
>>> one column, it works like a charm, so I can clearly assume it is not a bug
>>> in my query or code.
>>>
>>> Is it then a out of memory error? As I think, internally, since there
>>> are many registered tables on memory, they're getting deleted due to
>>> overflow of data and getting deleted, this is totally my assumption. Any
>>> insight on this? Did anyone of you face any issue like this?
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.: 
>>> org.apache.spark.sql.AnalysisException: Resolved attribute(s) 
>>> fnlwgt_bucketed#152530 missing from 
>>> occupation#17,high_income#25,fnlwgt#13,education#14,marital-status#16,relationship#18,workclass#12,sex#20,id_num#10,native_country#24,race#19,education-num#15,hours-per-week#23,age_bucketed#152432,capital-loss#22,age#11,capital-gain#21,fnlwgt_bucketed#99009
>>>  in operator !Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
>>> education#14, education-num#15, marital-status#16, occupation#17, 
>>> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
>>> hours-per-week#23, native_country#24, high_income#25, age_bucketed#152432, 
>>> fnlwgt_bucketed#152530, if (isnull(cast(hours-per-week#23 as double))) null 
>>> else if (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else if 
>>> (isnull(cast(hours-per-week#23 as double))) null else 
>>> UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS 
>>> hours-per-week_bucketed#152299]. Attribute(s) with the same name appear in 
>>> the operation: fnlwgt_bucketed. Please check if the right attribute(s) are 
>>> used.;;Project [id_num#10, age#11, workclass#12, fnlwgt#13, education#14, 
>>> education-num#15, marital-status#16, occupation#17, relationship#18, 
>>> race#19, sex#20, capital-gain#21, capital-loss#22, hours-per-week#23, 
>>> native_country#24, high_income#25, age_bucketed#48257, 
>>> fnlwgt_bucketed#99009, hours-per-week_bucketed#152299, 
>>> age_bucketed_WoE#152431, WoE#152524 AS fnlwgt_bucketed_WoE#152529]+- Join 
>>> Inner, (fnlwgt_bucketed#99009 = fnlwgt_bucketed#152530)
>>>:- SubqueryAlias bucketed
>>>:  +- SubqueryAlias a
>>>: +- Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
>>> education#14, education-num#15, marital-status#16, occupation#17, 
>>> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
>>> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, 
>>> fnlwgt_bucketed#99009, hours-per-week_bucketed#152299, WoE#152426 AS 
>>> age_bucketed_WoE#152431]
>>>:+- Join Inner, (age_bucketed#48257 = 

Re: Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
You're right.

The same set of queries are working for max 2 columns in loop.

If I give more than 2 column, the 2nd column is failing with this error -

*attribute(s) with the same name appear in the operation:
marginal_adhesion_bucketed. Please check if the right attribute(s) are
used.*

Any idea on what maybe the reason?

I rechecked the query, its has correct logic.

On Sat, May 26, 2018 at 9:35 PM, hemant singh  wrote:

> Per the sql plan this is where it is failing -
>
> Attribute(s) with the same name appear in the operation: fnlwgt_bucketed. 
> Please check if the right attribute(s) are used.;
>
>
>
> On Sat, May 26, 2018 at 6:16 PM, Aakash Basu 
> wrote:
>
>> Hi,
>>
>> This query is based on one step further from the query in this link
>> .
>> In this scenario, I add 1 or 2 more columns to be processed, Spark throws
>> an ERROR by printing the physical plan of queries.
>>
>> It says, *Resolved attribute(s) fnlwgt_bucketed#152530 missing* which is
>> untrue, as if I run the same code on less than 3 columns where this is one
>> column, it works like a charm, so I can clearly assume it is not a bug in
>> my query or code.
>>
>> Is it then a out of memory error? As I think, internally, since there are
>> many registered tables on memory, they're getting deleted due to overflow
>> of data and getting deleted, this is totally my assumption. Any insight on
>> this? Did anyone of you face any issue like this?
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.: 
>> org.apache.spark.sql.AnalysisException: Resolved attribute(s) 
>> fnlwgt_bucketed#152530 missing from 
>> occupation#17,high_income#25,fnlwgt#13,education#14,marital-status#16,relationship#18,workclass#12,sex#20,id_num#10,native_country#24,race#19,education-num#15,hours-per-week#23,age_bucketed#152432,capital-loss#22,age#11,capital-gain#21,fnlwgt_bucketed#99009
>>  in operator !Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
>> education#14, education-num#15, marital-status#16, occupation#17, 
>> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
>> hours-per-week#23, native_country#24, high_income#25, age_bucketed#152432, 
>> fnlwgt_bucketed#152530, if (isnull(cast(hours-per-week#23 as double))) null 
>> else if (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else 
>> UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS 
>> hours-per-week_bucketed#152299]. Attribute(s) with the same name appear in 
>> the operation: fnlwgt_bucketed. Please check if the right attribute(s) are 
>> used.;;Project [id_num#10, age#11, workclass#12, fnlwgt#13, education#14, 
>> education-num#15, marital-status#16, occupation#17, relationship#18, 
>> race#19, sex#20, capital-gain#21, capital-loss#22, hours-per-week#23, 
>> native_country#24, high_income#25, age_bucketed#48257, 
>> fnlwgt_bucketed#99009, hours-per-week_bucketed#152299, 
>> age_bucketed_WoE#152431, WoE#152524 AS fnlwgt_bucketed_WoE#152529]+- Join 
>> Inner, (fnlwgt_bucketed#99009 = fnlwgt_bucketed#152530)
>>:- SubqueryAlias bucketed
>>:  +- SubqueryAlias a
>>: +- Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
>> education#14, education-num#15, marital-status#16, occupation#17, 
>> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
>> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, 
>> fnlwgt_bucketed#99009, hours-per-week_bucketed#152299, WoE#152426 AS 
>> age_bucketed_WoE#152431]
>>:+- Join Inner, (age_bucketed#48257 = age_bucketed#152432)
>>:   :- SubqueryAlias bucketed
>>:   :  +- SubqueryAlias a
>>:   : +- Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
>> education#14, education-num#15, marital-status#16, occupation#17, 
>> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
>> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, 
>> fnlwgt_bucketed#99009, if (isnull(cast(hours-per-week#23 as double))) null 
>> else if (isnull(cast(hours-per-week#23 as double))) null else if 
>> (isnull(cast(hours-per-week#23 as double))) null else 
>> UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS 

Re: Spark 2.3 Tree Error

2018-05-26 Thread hemant singh
Per the sql plan this is where it is failing -

Attribute(s) with the same name appear in the operation:
fnlwgt_bucketed. Please check if the right attribute(s) are used.;



On Sat, May 26, 2018 at 6:16 PM, Aakash Basu 
wrote:

> Hi,
>
> This query is based on one step further from the query in this link
> .
> In this scenario, I add 1 or 2 more columns to be processed, Spark throws
> an ERROR by printing the physical plan of queries.
>
> It says, *Resolved attribute(s) fnlwgt_bucketed#152530 missing* which is
> untrue, as if I run the same code on less than 3 columns where this is one
> column, it works like a charm, so I can clearly assume it is not a bug in
> my query or code.
>
> Is it then a out of memory error? As I think, internally, since there are
> many registered tables on memory, they're getting deleted due to overflow
> of data and getting deleted, this is totally my assumption. Any insight on
> this? Did anyone of you face any issue like this?
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.: 
> org.apache.spark.sql.AnalysisException: Resolved attribute(s) 
> fnlwgt_bucketed#152530 missing from 
> occupation#17,high_income#25,fnlwgt#13,education#14,marital-status#16,relationship#18,workclass#12,sex#20,id_num#10,native_country#24,race#19,education-num#15,hours-per-week#23,age_bucketed#152432,capital-loss#22,age#11,capital-gain#21,fnlwgt_bucketed#99009
>  in operator !Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
> education#14, education-num#15, marital-status#16, occupation#17, 
> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
> hours-per-week#23, native_country#24, high_income#25, age_bucketed#152432, 
> fnlwgt_bucketed#152530, if (isnull(cast(hours-per-week#23 as double))) null 
> else if (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else 
> UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS 
> hours-per-week_bucketed#152299]. Attribute(s) with the same name appear in 
> the operation: fnlwgt_bucketed. Please check if the right attribute(s) are 
> used.;;Project [id_num#10, age#11, workclass#12, fnlwgt#13, education#14, 
> education-num#15, marital-status#16, occupation#17, relationship#18, race#19, 
> sex#20, capital-gain#21, capital-loss#22, hours-per-week#23, 
> native_country#24, high_income#25, age_bucketed#48257, fnlwgt_bucketed#99009, 
> hours-per-week_bucketed#152299, age_bucketed_WoE#152431, WoE#152524 AS 
> fnlwgt_bucketed_WoE#152529]+- Join Inner, (fnlwgt_bucketed#99009 = 
> fnlwgt_bucketed#152530)
>:- SubqueryAlias bucketed
>:  +- SubqueryAlias a
>: +- Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
> education#14, education-num#15, marital-status#16, occupation#17, 
> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, 
> fnlwgt_bucketed#99009, hours-per-week_bucketed#152299, WoE#152426 AS 
> age_bucketed_WoE#152431]
>:+- Join Inner, (age_bucketed#48257 = age_bucketed#152432)
>:   :- SubqueryAlias bucketed
>:   :  +- SubqueryAlias a
>:   : +- Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
> education#14, education-num#15, marital-status#16, occupation#17, 
> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, 
> fnlwgt_bucketed#99009, if (isnull(cast(hours-per-week#23 as double))) null 
> else if (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else 
> UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS 
> hours-per-week_bucketed#152299]
>:   :+- Project [id_num#10, age#11, workclass#12, 
> fnlwgt#13, education#14, education-num#15, marital-status#16, occupation#17, 
> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, if 
> (isnull(cast(fnlwgt#13 as double))) null else if (isnull(cast(fnlwgt#13 as 
> double))) null else if (isnull(cast(fnlwgt#13 as double))) null else 
> UDF:bucketizer_0(cast(fnlwgt#13 as double)) 

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Well, it did, meaning, internally a TempTable and a TempView are the same.

Thanks buddy!

On Sat, May 26, 2018 at 9:23 PM, Aakash Basu 
wrote:

> Question is, while registering, using registerTempTable() and while
> dropping, using a dropTempView(), would it go and hit the same TempTable
> internally or would search for a registered view? Not sure. Any idea?
>
> On Sat, May 26, 2018 at 9:04 PM, SNEHASISH DUTTA  > wrote:
>
>> I think it's dropTempView
>>
>> On Sat, May 26, 2018, 8:56 PM Aakash Basu 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to use dropTempTable() after the respective Temporary Table's
>>> use is over (to free up the memory for next calculations).
>>>
>>> Newer Spark Session doesn't need sqlContext, so, it is confusing me on
>>> how to use the function.
>>>
>>> 1) Tried, same DF which I used to register a temp table to do -
>>>
>>> DF.dropTempTable('xyz')
>>>
>>> Didn't work.
>>>
>>> 2) Tried following way too, as spark internally invokes sqlContext too
>>> along with sparkContext, but didn't work -
>>>
>>> spark.dropTempTable('xyz')
>>>
>>> 3) Tried spark.catalog to drop, this failed too -
>>>
>>> spark.catalog.dropTempTable('xyz')
>>>
>>>
>>> What to do? 1.6 examples on internet are not working in the 2.3 version
>>> for dropTempTable().
>>>
>>> Thanks,
>>> Aakash.
>>>
>>
>


Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Question is, while registering, using registerTempTable() and while
dropping, using a dropTempView(), would it go and hit the same TempTable
internally or would search for a registered view? Not sure. Any idea?

On Sat, May 26, 2018 at 9:04 PM, SNEHASISH DUTTA 
wrote:

> I think it's dropTempView
>
> On Sat, May 26, 2018, 8:56 PM Aakash Basu 
> wrote:
>
>> Hi all,
>>
>> I'm trying to use dropTempTable() after the respective Temporary Table's
>> use is over (to free up the memory for next calculations).
>>
>> Newer Spark Session doesn't need sqlContext, so, it is confusing me on
>> how to use the function.
>>
>> 1) Tried, same DF which I used to register a temp table to do -
>>
>> DF.dropTempTable('xyz')
>>
>> Didn't work.
>>
>> 2) Tried following way too, as spark internally invokes sqlContext too
>> along with sparkContext, but didn't work -
>>
>> spark.dropTempTable('xyz')
>>
>> 3) Tried spark.catalog to drop, this failed too -
>>
>> spark.catalog.dropTempTable('xyz')
>>
>>
>> What to do? 1.6 examples on internet are not working in the 2.3 version
>> for dropTempTable().
>>
>> Thanks,
>> Aakash.
>>
>


Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Hi all,

I'm trying to use dropTempTable() after the respective Temporary Table's
use is over (to free up the memory for next calculations).

Newer Spark Session doesn't need sqlContext, so, it is confusing me on how
to use the function.

1) Tried, same DF which I used to register a temp table to do -

DF.dropTempTable('xyz')

Didn't work.

2) Tried following way too, as spark internally invokes sqlContext too
along with sparkContext, but didn't work -

spark.dropTempTable('xyz')

3) Tried spark.catalog to drop, this failed too -

spark.catalog.dropTempTable('xyz')


What to do? 1.6 examples on internet are not working in the 2.3 version for
dropTempTable().

Thanks,
Aakash.


Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
Hi,

This query is based on one step further from the query in this link
.
In this scenario, I add 1 or 2 more columns to be processed, Spark throws
an ERROR by printing the physical plan of queries.

It says, *Resolved attribute(s) fnlwgt_bucketed#152530 missing* which is
untrue, as if I run the same code on less than 3 columns where this is one
column, it works like a charm, so I can clearly assume it is not a bug in
my query or code.

Is it then a out of memory error? As I think, internally, since there are
many registered tables on memory, they're getting deleted due to overflow
of data and getting deleted, this is totally my assumption. Any insight on
this? Did anyone of you face any issue like this?

py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.:
org.apache.spark.sql.AnalysisException: Resolved attribute(s)
fnlwgt_bucketed#152530 missing from
occupation#17,high_income#25,fnlwgt#13,education#14,marital-status#16,relationship#18,workclass#12,sex#20,id_num#10,native_country#24,race#19,education-num#15,hours-per-week#23,age_bucketed#152432,capital-loss#22,age#11,capital-gain#21,fnlwgt_bucketed#99009
in operator !Project [id_num#10, age#11, workclass#12, fnlwgt#13,
education#14, education-num#15, marital-status#16, occupation#17,
relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22,
hours-per-week#23, native_country#24, high_income#25,
age_bucketed#152432, fnlwgt_bucketed#152530, if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else
UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS
hours-per-week_bucketed#152299]. Attribute(s) with the same name
appear in the operation: fnlwgt_bucketed. Please check if the right
attribute(s) are used.;;Project [id_num#10, age#11, workclass#12,
fnlwgt#13, education#14, education-num#15, marital-status#16,
occupation#17, relationship#18, race#19, sex#20, capital-gain#21,
capital-loss#22, hours-per-week#23, native_country#24, high_income#25,
age_bucketed#48257, fnlwgt_bucketed#99009,
hours-per-week_bucketed#152299, age_bucketed_WoE#152431, WoE#152524 AS
fnlwgt_bucketed_WoE#152529]+- Join Inner, (fnlwgt_bucketed#99009 =
fnlwgt_bucketed#152530)
   :- SubqueryAlias bucketed
   :  +- SubqueryAlias a
   : +- Project [id_num#10, age#11, workclass#12, fnlwgt#13,
education#14, education-num#15, marital-status#16, occupation#17,
relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22,
hours-per-week#23, native_country#24, high_income#25,
age_bucketed#48257, fnlwgt_bucketed#99009,
hours-per-week_bucketed#152299, WoE#152426 AS age_bucketed_WoE#152431]
   :+- Join Inner, (age_bucketed#48257 = age_bucketed#152432)
   :   :- SubqueryAlias bucketed
   :   :  +- SubqueryAlias a
   :   : +- Project [id_num#10, age#11, workclass#12,
fnlwgt#13, education#14, education-num#15, marital-status#16,
occupation#17, relationship#18, race#19, sex#20, capital-gain#21,
capital-loss#22, hours-per-week#23, native_country#24, high_income#25,
age_bucketed#48257, fnlwgt_bucketed#99009, if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else if
(isnull(cast(hours-per-week#23 as double))) null else
UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS
hours-per-week_bucketed#152299]
   :   :+- Project [id_num#10, age#11, workclass#12,
fnlwgt#13, education#14, education-num#15, marital-status#16,
occupation#17, relationship#18, race#19, sex#20, capital-gain#21,
capital-loss#22, hours-per-week#23, native_country#24, high_income#25,
age_bucketed#48257, if (isnull(cast(fnlwgt#13 as double))) null else
if (isnull(cast(fnlwgt#13 as double))) null else if
(isnull(cast(fnlwgt#13 as double))) null else
UDF:bucketizer_0(cast(fnlwgt#13 as double)) AS fnlwgt_bucketed#99009]
   :   :   +- Project [id_num#10, age#11,
workclass#12, fnlwgt#13, education#14, education-num#15,
marital-status#16, occupation#17, relationship#18, race#19, sex#20,
capital-gain#21, capital-loss#22, hours-per-week#23,
native_country#24, high_income#25, if (isnull(cast(age#11 as double)))
null else if (isnull(cast(age#11 as double))) null else if
(isnull(cast(age#11 as double))) null else

Spark 2.3 Memory Leak on Executor

2018-05-26 Thread Aakash Basu
Hi,

I am getting memory leak warning which ideally was a Spark bug back till
1.6 version and was resolved.

Mode: Standalone IDE: PyCharm Spark version: 2.3 Python version: 3.6

Below is the stack trace -

2018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak detected;
size = 262144 bytes, TID = 31482018-05-25 15:00:05 WARN  Executor:66 -
Managed memory leak detected; size = 262144 bytes, TID =
31522018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak
detected; size = 262144 bytes, TID = 31512018-05-25 15:00:05 WARN
Executor:66 - Managed memory leak detected; size = 262144 bytes, TID =
31502018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak
detected; size = 262144 bytes, TID = 31492018-05-25 15:00:05 WARN
Executor:66 - Managed memory leak detected; size = 262144 bytes, TID =
31532018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak
detected; size = 262144 bytes, TID = 31542018-05-25 15:00:05 WARN
Executor:66 - Managed memory leak detected; size = 262144 bytes, TID =
31582018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak
detected; size = 262144 bytes, TID = 31552018-05-25 15:00:05 WARN
Executor:66 - Managed memory leak detected; size = 262144 bytes, TID =
31572018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak
detected; size = 262144 bytes, TID = 31602018-05-25 15:00:05 WARN
Executor:66 - Managed memory leak detected; size = 262144 bytes, TID =
31612018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak
detected; size = 262144 bytes, TID = 31562018-05-25 15:00:05 WARN
Executor:66 - Managed memory leak detected; size = 262144 bytes, TID =
31592018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak
detected; size = 262144 bytes, TID = 31652018-05-25 15:00:05 WARN
Executor:66 - Managed memory leak detected; size = 262144 bytes, TID =
31632018-05-25 15:00:05 WARN  Executor:66 - Managed memory leak
detected; size = 262144 bytes, TID = 31622018-05-25 15:00:05 WARN
Executor:66 - Managed memory leak detected; size = 262144 bytes, TID =
3166

Any insight on why it may happen? Though my job is successfully getting
accomplished.

I've posted the query on StackOverflow

too.

P.S - No connection to database is kept open (as per a comment there).

Thanks,
Aakash.


what defines dataset partition number in spark sql

2018-05-26 Thread ε΄”θ‹—
Hi,
I want to know when I create a dataset by reading files in hdfs in spark sql,
like : Dataset user = spark.read().format("json").load(filePath) , what 
defines the partition number of the dataset?
And what if the filePath is a directory instead of a singe file ?
Why we can't get the partitions number of dataset by 
dataset.getNumPartitions()? why we must change the dataset to rdd to get 
partition number: dataset.rdd().getNumPartitions() ?


Thanks