Re: Spark UI crashes on Large Workloads

2017-07-18 Thread Saatvik Shah
Hi Riccardo,

Thanks for your suggestions.
The thing is that my Spark UI is the one thing that is crashing - and not
the app. In fact the app does end up completing successfully.
That's why I'm a bit confused by this issue?
I'll still try out some of your suggestions.
Thanks and Regards,
Saatvik Shah


On Tue, Jul 18, 2017 at 9:59 AM, Riccardo Ferrari <ferra...@gmail.com>
wrote:

> The reason you get connection refused when connecting to the application
> UI (port 4040) is because you app gets stopped thus the application UI
> stops as well. To inspect your executors logs after the fact you might find
> useful the Spark History server
> <https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact>
> (for standalone mode).
>
> Personally I I collect the logs from my worker nodes. They generally sit
> under the $SPARK_HOME/work// (for standalone).
> There you can find exceptions and messages from the executors assigned to
> your app.
>
> Now, about you app crashing, might be useful check whether it is sized
> correctly. The issue you linked sounds appropriate however I would give
> some sanity checks a try. I solved many issues just by sizing an app that I
> would first check memory size, cpu allocations and so on..
>
> Best,
>
> On Tue, Jul 18, 2017 at 3:30 PM, Saatvik Shah <saatvikshah1...@gmail.com>
> wrote:
>
>> Hi Riccardo,
>>
>> Yes, Thanks for suggesting I do that.
>>
>> [Stage 1:==>   (12750 + 40)
>> / 15000]17/07/18 13:22:28 ERROR org.apache.spark.scheduler.LiveListenerBus:
>> Dropping SparkListenerEvent because no remaining room in event queue. This
>> likely means one of the SparkListeners is too slow and cannot keep up with
>> the rate at which tasks are being started by the scheduler.
>> 17/07/18 13:22:28 WARN org.apache.spark.scheduler.LiveListenerBus:
>> Dropped 1 SparkListenerEvents since Thu Jan 01 00:00:00 UTC 1970
>> [Stage 1:> (13320 + 41)
>> / 15000]17/07/18 13:23:28 WARN org.apache.spark.scheduler.LiveListenerBus:
>> Dropped 26782 SparkListenerEvents since Tue Jul 18 13:22:28 UTC 2017
>> [Stage 1:==>   (13867 + 40)
>> / 15000]17/07/18 13:24:28 WARN org.apache.spark.scheduler.LiveListenerBus:
>> Dropped 58751 SparkListenerEvents since Tue Jul 18 13:23:28 UTC 2017
>> [Stage 1:===>  (14277 + 40)
>> / 15000]17/07/18 13:25:10 INFO 
>> org.spark_project.jetty.server.ServerConnector:
>> Stopped ServerConnector@3b7284c4{HTTP/1.1}{0.0.0.0:4040}
>> 17/07/18 13:25:10 ERROR org.apache.spark.scheduler.LiveListenerBus:
>> SparkListenerBus has already stopped! Dropping event
>> SparkListenerExecutorMetricsUpdate(4,WrappedArray())
>> And similar WARN/INFO messages continue occurring.
>>
>> When I try to access the UI, I get:
>>
>> Problem accessing /proxy/application_1500380353993_0001/. Reason:
>>
>> Connection to http://10.142.0.17:4040 refused
>>
>> Caused by:
>>
>> org.apache.http.conn.HttpHostConnectException: Connection to 
>> http://10.142.0.17:4040 refused
>>  at 
>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
>>  at 
>> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
>>  at 
>> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
>>  at 
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
>>  at 
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>>  at 
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>>  at 
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>>  at 
>> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:200)
>>  at 
>> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:387)
>>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>
>>
>>
>> I noticed this issue talks about something similar and I guess is
>> related: https://issues.apache.org/jira/browse/SPARK-18838.
>>
>> On Tue, Jul 18, 2017 at 2:49 AM, Riccardo Ferrari <ferra...@gmail.com>
>> wrote:
>>
>>> Hi,
>>&

Re: Spark UI crashes on Large Workloads

2017-07-18 Thread Saatvik Shah
Hi Riccardo,

Yes, Thanks for suggesting I do that.

[Stage 1:==>   (12750 + 40) /
15000]17/07/18 13:22:28 ERROR org.apache.spark.scheduler.LiveListenerBus:
Dropping SparkListenerEvent because no remaining room in event queue. This
likely means one of the SparkListeners is too slow and cannot keep up with
the rate at which tasks are being started by the scheduler.
17/07/18 13:22:28 WARN org.apache.spark.scheduler.LiveListenerBus: Dropped
1 SparkListenerEvents since Thu Jan 01 00:00:00 UTC 1970
[Stage 1:> (13320 + 41) /
15000]17/07/18 13:23:28 WARN org.apache.spark.scheduler.LiveListenerBus:
Dropped 26782 SparkListenerEvents since Tue Jul 18 13:22:28 UTC 2017
[Stage 1:==>   (13867 + 40) /
15000]17/07/18 13:24:28 WARN org.apache.spark.scheduler.LiveListenerBus:
Dropped 58751 SparkListenerEvents since Tue Jul 18 13:23:28 UTC 2017
[Stage 1:===>  (14277 + 40) /
15000]17/07/18 13:25:10 INFO
org.spark_project.jetty.server.ServerConnector: Stopped
ServerConnector@3b7284c4{HTTP/1.1}{0.0.0.0:4040}
17/07/18 13:25:10 ERROR org.apache.spark.scheduler.LiveListenerBus:
SparkListenerBus has already stopped! Dropping event
SparkListenerExecutorMetricsUpdate(4,WrappedArray())
And similar WARN/INFO messages continue occurring.

When I try to access the UI, I get:

Problem accessing /proxy/application_1500380353993_0001/. Reason:

Connection to http://10.142.0.17:4040 refused

Caused by:

org.apache.http.conn.HttpHostConnectException: Connection to
http://10.142.0.17:4040 refused
at 
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at 
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:200)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:387)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)



I noticed this issue talks about something similar and I guess is related:
https://issues.apache.org/jira/browse/SPARK-18838.

On Tue, Jul 18, 2017 at 2:49 AM, Riccardo Ferrari <ferra...@gmail.com>
wrote:

> Hi,
>  can you share more details. do you have any exceptions from the driver?
> or executors?
>
> best,
>
> On Jul 18, 2017 02:49, "saatvikshah1994" <saatvikshah1...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a pyspark App which when provided a huge amount of data as input
>> throws the error explained here sometimes:
>> https://stackoverflow.com/questions/32340639/unable-to-under
>> stand-error-sparklistenerbus-has-already-stopped-dropping-event.
>> All my code is running inside the main function, and the only slightly
>> peculiar thing I am doing in this app is using a custom PySpark ML
>> Transformer(Modified from
>> https://stackoverflow.com/questions/32331848/create-a-custom
>> -transformer-in-pyspark-ml).
>> Could this be the issue? How can I debug why this is happening?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-UI-crashes-on-Large-Workloads-tp28873.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


-- 
*Saatvik Shah,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University,*
*LinkedIn <https://www.linkedin.com/in/saatvikshah/>, Website
<https://saatvikshah1994.github.io/>*


Re: PySpark working with Generators

2017-07-05 Thread Saatvik Shah
Hi Jörn,

I apologize for such a late response.

Yes, the data volume is very high(won't fit on 1 machine's memory) and I am
getting a significant benefit when reading the files in a distributed
manner.
Since the data volume is high, converting it to an alternative format would
be a worst case scenario.
I agree on writing a custom Spark writer, but that might take a while, and
to proceed with the work till then was hoping to use the current
implementation itself which is fast enough to work with. The only issue is
the one I've already discussed, which is of working with generators to
allow low memory executor tasks.
I'm not sure I fully understand your recommendation on the core usage -
Could you explain in a little more detail? I'm currently using dynamic
allocation with YARN allowing each spark executor 8 vcores.
The data format is proprietary and surely not heard of.

Thanks and Regards,
Saatvik Shah


On Fri, Jun 30, 2017 at 10:16 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> In this case i do not see so many benefits of using Spark. Is the data
> volume high?
> Alternatively i recommend to convert the proprietary format into a format
> Sparks understand and then use this format in Spark.
> Another alternative would be to write a custom Spark datasource. Even your
> proprietary format should be then able to be put on HDFS.
> That being said, I do not recommend to use more cores outside Sparks
> control. The reason is that Spark thinks these core are free and does the
> wrong allocation of executors/tasks. This will slow down all applications
> on Spark.
>
> May I ask what the format is called?
>
> On 30. Jun 2017, at 16:05, Saatvik Shah <saatvikshah1...@gmail.com> wrote:
>
> Hi Mahesh and Ayan,
>
> The files I'm working with are a very complex proprietary format, for whom
> I only have access to a reader function as I had described earlier which
> only accepts a path to a local file system.
> This rules out sc.wholeTextFile - since I cannot pass the contents of
> wholeTextFile to an function(API call) expecting a local file path.
> For similar reasons, I cannot use HDFS and am bound to using a highly
> available Network File System arrangement currently.
> Any suggestions, given these constraints? Or any incorrect assumptions
> you'll think I've made?
>
> Thanks and Regards,
> Saatvik Shah
>
>
>
> On Fri, Jun 30, 2017 at 12:50 AM, Mahesh Sawaiker <
> mahesh_sawai...@persistent.com> wrote:
>
>> Wouldn’t this work if you load the files in hdfs and let the partitions
>> be equal to the amount of parallelism you want?
>>
>>
>>
>> *From:* Saatvik Shah [mailto:saatvikshah1...@gmail.com]
>> *Sent:* Friday, June 30, 2017 8:55 AM
>> *To:* ayan guha
>> *Cc:* user
>> *Subject:* Re: PySpark working with Generators
>>
>>
>>
>> Hey Ayan,
>>
>>
>>
>> This isnt a typical text file - Its a proprietary data format for which a
>> native Spark reader is not available.
>>
>>
>>
>> Thanks and Regards,
>>
>> Saatvik Shah
>>
>>
>>
>> On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>> If your files are in same location you can use sc.wholeTextFile. If not,
>> sc.textFile accepts a list of filepaths.
>>
>>
>>
>> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <
>> saatvikshah1...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have this file reading function is called /foo/ which reads contents
>> into
>> a list of lists or into a generator of list of lists representing the same
>> file.
>>
>> When reading as a complete chunk(1 record array) I do something like:
>> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda
>> x:x)
>>
>> I'd like to now do something similar but with the generator, so that I can
>> work with more cores and a lower memory. I'm not sure how to tackle this
>> since generators cannot be pickled and thus I'm not sure how to ditribute
>> the work of reading each file_path on the rdd?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
>>
>> Best Regards,
>> Ayan Guha
>>
>>
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which is
>

Re: PySpark working with Generators

2017-06-30 Thread Saatvik Shah
Hi Mahesh and Ayan,

The files I'm working with are a very complex proprietary format, for whom
I only have access to a reader function as I had described earlier which
only accepts a path to a local file system.
This rules out sc.wholeTextFile - since I cannot pass the contents of
wholeTextFile to an function(API call) expecting a local file path.
For similar reasons, I cannot use HDFS and am bound to using a highly
available Network File System arrangement currently.
Any suggestions, given these constraints? Or any incorrect assumptions
you'll think I've made?

Thanks and Regards,
Saatvik Shah



On Fri, Jun 30, 2017 at 12:50 AM, Mahesh Sawaiker <
mahesh_sawai...@persistent.com> wrote:

> Wouldn’t this work if you load the files in hdfs and let the partitions be
> equal to the amount of parallelism you want?
>
>
>
> *From:* Saatvik Shah [mailto:saatvikshah1...@gmail.com]
> *Sent:* Friday, June 30, 2017 8:55 AM
> *To:* ayan guha
> *Cc:* user
> *Subject:* Re: PySpark working with Generators
>
>
>
> Hey Ayan,
>
>
>
> This isnt a typical text file - Its a proprietary data format for which a
> native Spark reader is not available.
>
>
>
> Thanks and Regards,
>
> Saatvik Shah
>
>
>
> On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <guha.a...@gmail.com> wrote:
>
> If your files are in same location you can use sc.wholeTextFile. If not,
> sc.textFile accepts a list of filepaths.
>
>
>
> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <saatvikshah1...@gmail.com>
> wrote:
>
> Hi,
>
> I have this file reading function is called /foo/ which reads contents into
> a list of lists or into a generator of list of lists representing the same
> file.
>
> When reading as a complete chunk(1 record array) I do something like:
> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)
>
> I'd like to now do something similar but with the generator, so that I can
> work with more cores and a lower memory. I'm not sure how to tackle this
> since generators cannot be pickled and thus I'm not sure how to ditribute
> the work of reading each file_path on the rdd?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
>
> Best Regards,
> Ayan Guha
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


Re: PySpark working with Generators

2017-06-29 Thread Saatvik Shah
Hey Ayan,

This isnt a typical text file - Its a proprietary data format for which a
native Spark reader is not available.

Thanks and Regards,
Saatvik Shah

On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <guha.a...@gmail.com> wrote:

> If your files are in same location you can use sc.wholeTextFile. If not,
> sc.textFile accepts a list of filepaths.
>
> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <saatvikshah1...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have this file reading function is called /foo/ which reads contents
>> into
>> a list of lists or into a generator of list of lists representing the same
>> file.
>>
>> When reading as a complete chunk(1 record array) I do something like:
>> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda
>> x:x)
>>
>> I'd like to now do something similar but with the generator, so that I can
>> work with more cores and a lower memory. I'm not sure how to tackle this
>> since generators cannot be pickled and thus I'm not sure how to ditribute
>> the work of reading each file_path on the rdd?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Best Regards,
> Ayan Guha
>


Re: Merging multiple Pandas dataframes

2017-06-22 Thread Saatvik Shah
Hi Assaf,

Thanks for your suggestion.

I also found one other improvement which is to iteratively convert Pandas
DFs to RDDs and take a union of those(similar to dataframes). Basically
calling createDataFrame is heavy + checkpointing of DataFrames is a brand
new feature. Instead create a huge union of RDDs and finally apply
createDataFrame in the end.

Thanks and Regards,
Saatvik

On Wed, Jun 21, 2017 at 2:03 AM, Mendelson, Assaf <assaf.mendel...@rsa.com>
wrote:

> If you do an action, most intermediate calculations would be gone for the
> next iteration.
>
> What I would do is persist every iteration, then after some (say 5) I
> would write to disk and reload. At that point you should call unpersist to
> free the memory as it is no longer relevant.
>
>
>
> Thanks,
>
>       Assaf.
>
>
>
> *From:* Saatvik Shah [mailto:saatvikshah1...@gmail.com]
> *Sent:* Tuesday, June 20, 2017 8:50 PM
> *To:* Mendelson, Assaf
> *Cc:* user@spark.apache.org
> *Subject:* Re: Merging multiple Pandas dataframes
>
>
>
> Hi Assaf,
>
> Thanks for the suggestion on checkpointing - I'll need to read up more on
> that.
>
> My current implementation seems to be crashing with a GC memory limit
> exceeded error if Im keeping multiple persist calls for a large number of
> files.
>
>
>
> Thus, I was also thinking about the constant calls to persist. Since all
> my actions are Spark transformations(union of large number of Spark
> Dataframes from Pandas dataframes), this entire process of building a large
> Spark dataframe is essentially a huge transformation. Is it necessary to
> call persist between unions? Shouldnt I instead wait for all the unions to
> complete and call persist finally?
>
>
>
>
>
> On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf <assaf.mendel...@rsa.com>
> wrote:
>
> Note that depending on the number of iterations, the query plan for the
> dataframe can become long and this can cause slowdowns (or even crashes).
> A possible solution would be to checkpoint (or simply save and reload the
> dataframe) every once in a while. When reloading from disk, the newly
> loaded dataframe's lineage is just the disk...
>
> Thanks,
>   Assaf.
>
>
> -Original Message-
> From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com]
> Sent: Tuesday, June 20, 2017 2:22 AM
> To: user@spark.apache.org
> Subject: Merging multiple Pandas dataframes
>
> Hi,
>
> I am iteratively receiving a file which can only be opened as a Pandas
> dataframe. For the first such file I receive, I am converting this to a
> Spark dataframe using the 'createDataframe' utility function. The next file
> onward, I am converting it and union'ing it into the first Spark
> dataframe(the schema always stays the same). After each union, I am
> persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted
> all such files to a single spark dataframe I am coalescing it. Following
> some tips from this Stack Overflow
> post(https://stackoverflow.com/questions/39381183/
> managing-spark-partitions-after-dataframe-unions).
>
> Any suggestions for optimizing this process further?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
> --
>
> *Saatvik Shah,*
>
> *1st  Year,*
>
> *Masters in the School of Computer Science,*
>
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>



-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*


Re: Merging multiple Pandas dataframes

2017-06-20 Thread Saatvik Shah
Hi Assaf,

Thanks for the suggestion on checkpointing - I'll need to read up more on
that.

My current implementation seems to be crashing with a GC memory limit
exceeded error if Im keeping multiple persist calls for a large number of
files.

Thus, I was also thinking about the constant calls to persist. Since all my
actions are Spark transformations(union of large number of Spark Dataframes
from Pandas dataframes), this entire process of building a large Spark
dataframe is essentially a huge transformation. Is it necessary to call
persist between unions? Shouldnt I instead wait for all the unions to
complete and call persist finally?




On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf <assaf.mendel...@rsa.com>
wrote:

> Note that depending on the number of iterations, the query plan for the
> dataframe can become long and this can cause slowdowns (or even crashes).
> A possible solution would be to checkpoint (or simply save and reload the
> dataframe) every once in a while. When reloading from disk, the newly
> loaded dataframe's lineage is just the disk...
>
> Thanks,
>   Assaf.
>
> -Original Message-
> From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com]
> Sent: Tuesday, June 20, 2017 2:22 AM
> To: user@spark.apache.org
> Subject: Merging multiple Pandas dataframes
>
> Hi,
>
> I am iteratively receiving a file which can only be opened as a Pandas
> dataframe. For the first such file I receive, I am converting this to a
> Spark dataframe using the 'createDataframe' utility function. The next file
> onward, I am converting it and union'ing it into the first Spark
> dataframe(the schema always stays the same). After each union, I am
> persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted
> all such files to a single spark dataframe I am coalescing it. Following
> some tips from this Stack Overflow
> post(https://stackoverflow.com/questions/39381183/
> managing-spark-partitions-after-dataframe-unions).
>
> Any suggestions for optimizing this process further?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*


Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Saatvik Shah
Thanks guys,

You'll have given a number of options to work with.

The thing is that Im working in a production environment where it might be
necessary to ensure that no one erroneously inserts new records in those
specific columns which should be the Category data type. The best
alternative there would be to have a Category-like dataframe column
datatype, without the additional overhead of running a transformer. Is that
possible?

Thanks and Regards,
Saatvik

On Sat, Jun 17, 2017 at 11:15 PM, Pralabh Kumar <pralabhku...@gmail.com>
wrote:

> make sense :)
>
> On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) <facai@gmail.com>
> wrote:
>
>> Yes, perhaps we could use SQLTransformer as well.
>>
>> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>>
>> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar <pralabhku...@gmail.com>
>> wrote:
>>
>>> Hi Yan
>>>
>>> Yes sql is good option , but if we have to create ML Pipeline , then
>>> having transformers and set it into pipeline stages ,would be better option
>>> .
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <facai@gmail.com>
>>> wrote:
>>>
>>>> To filter data, how about using sql?
>>>>
>>>> df.createOrReplaceTempView("df")
>>>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
>>>> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>>>
>>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>>>
>>>>
>>>>
>>>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pralabhku...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Saatvik
>>>>>
>>>>> You can write your own transformer to make sure that column contains
>>>>> ,value which u provided , and filter out rows which doesn't follow the
>>>>> same.
>>>>>
>>>>> Something like this
>>>>>
>>>>>
>>>>> case class CategoryTransformer(override val uid : String) extends
>>>>> Transformer{
>>>>>   override def transform(inputData: DataFrame): DataFrame = {
>>>>> inputData.select("col1").filter("col1 in ('happy')")
>>>>>   }
>>>>>   override def copy(extra: ParamMap): Transformer = ???
>>>>>   @DeveloperApi
>>>>>   override def transformSchema(schema: StructType): StructType ={
>>>>>schema
>>>>>   }
>>>>> }
>>>>>
>>>>>
>>>>> Usage
>>>>>
>>>>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>>>>> val trans = new CategoryTransformer("1")
>>>>> data.show()
>>>>> trans.transform(data).show()
>>>>>
>>>>>
>>>>> This transformer will make sure , you always have values in col1 as
>>>>> provided by you.
>>>>>
>>>>>
>>>>> Regards
>>>>> Pralabh Kumar
>>>>>
>>>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
>>>>> saatvikshah1...@gmail.com> wrote:
>>>>>
>>>>>> Hi Pralabh,
>>>>>>
>>>>>> I want the ability to create a column such that its values be
>>>>>> restricted to a specific set of predefined values.
>>>>>> For example, suppose I have a column called EMOTION: I want to ensure
>>>>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Saatvik Shah
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
>>>>>> pralabhku...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi satvik
>>>>>>>
>>>>>>> Can u please provide an example of what exactly you want.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Yan,
>>>>>>>>
>>>>>>>> Basically the reason I was looking for the categorical datatype 

Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Saatvik Shah
Hi Pralabh,

I want the ability to create a column such that its values be restricted to
a specific set of predefined values.
For example, suppose I have a column called EMOTION: I want to ensure each
row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.

Thanks and Regards,
Saatvik Shah

On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pralabhku...@gmail.com>
wrote:

> Hi satvik
>
> Can u please provide an example of what exactly you want.
>
>
>
> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com> wrote:
>
>> Hi Yan,
>>
>> Basically the reason I was looking for the categorical datatype is as
>> given here
>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>: ability
>> to fix column values to specific categories. Is it possible to create a
>> user defined data type which could do so?
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai@gmail.com>
>> wrote:
>>
>>> You can use some Transformers to handle categorical data,
>>> For example,
>>> StringIndexer encodes a string column of labels to a column of label
>>> indices:
>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>
>>>
>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>> saatvikshah1...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>>>> have
>>>> is of the Category type in Pandas. But there does not seem to be
>>>> support for
>>>> this same type in Spark. What is the best alternative?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-spark-user-list.
>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>> Spark-Dataframe-tp28764.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Saatvik Shah
Hi Yan,

Basically the reason I was looking for the categorical datatype is as given
here <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
ability to fix column values to specific categories. Is it possible to
create a user defined data type which could do so?

Thanks and Regards,
Saatvik Shah

On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai@gmail.com> wrote:

> You can use some Transformers to handle categorical data,
> For example,
> StringIndexer encodes a string column of labels to a column of label
> indices:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>
>
> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi,
>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>> have
>> is of the Category type in Pandas. But there does not seem to be support
>> for
>> this same type in Spark. What is the best alternative?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>> Spark-Dataframe-tp28764.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -----
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*