Mapping words to vector sparkml CountVectorizerModel

2017-12-18 Thread Sandeep Nemuri
Hi All,

I've used CountVectorizerModel in spark ml and got the td-idf of the words.

Output column of a df looks like:

*(63709,[0,1,2,3,6,7,8,10,11,13],[0.6095235999680518,0.9946971867717818,0.5151611294911758,0.4371112749198506,3.4968901993588046,0.06806241719930584,1.1156025996012633,3.0425756717399217,0.3760235829400124])*

Wanted to get top n words which are mapped with this ranking.

Any pointers on how to achieve this?

-- 
*  Regards*
*  Sandeep Nemuri*


Re: SparkSession via HS2 - Error -spark.yarn.jars not read

2017-07-05 Thread Sandeep Nemuri
STS will refer spark-thrift-sparkconf.conf, Can you check if the
spark.yarn.jars exists in this file?



On Wed, Jul 5, 2017 at 2:01 PM, Sudha KS <sudha...@fuzzylogix.com> wrote:

> The property “spark.yarn.jars” available via /usr/hdp/current/spark2-
> client/conf/spark-default.conf
>
>
>
> spark.yarn.jars hdfs://ambari03.fuzzyl.com:8020/hdp/apps/2.6.1.0-129/
> spark2
>
>
>
>
>
> Is there any other way to set/read/pass this property “spark.yarn.jars” ?
>
>
>
> *From:* Sudha KS [mailto:sudha...@fuzzylogix.com]
> *Sent:* Wednesday, July 5, 2017 1:51 PM
> *To:* user@spark.apache.org
> *Subject:* SparkSession via HS2 - Error -spark.yarn.jars not read
>
>
>
> Why does “spark.yarn.jars” property not read, in this HDP 2.6 , Spark2.1.1
> cluster:
>
> 0: jdbc:hive2://localhost:1/db> set spark.yarn.jars;
>
> +---
> ---+--+
>
> | set
> |
>
> +---
> ---+--+
>
> | spark.yarn.jars=hdfs://ambari03.fuzzyl.com:8020/hdp/apps/2.
> 6.1.0-129/spark2  |
>
> +---
> ---+--+
>
> 1 row selected (0.101 seconds)
>
> 0: jdbc:hive2://localhost:1/db>
>
>
>
>
>
>
>
> Error during launch of a SparkSession via HS2:
>
> Caused by: java.lang.IllegalStateException: Library directory
> '/hadoop/yarn/local/usercache/hive/appcache/application_
> 1499235958765_0042/container_e04_1499235958765_0042_01_
> 05/assembly/target/scala-2.11/jars' does not exist; make sure Spark
> is built.
>
> at org.apache.spark.launcher.CommandBuilderUtils.checkState(
> CommandBuilderUtils.java:260)
>
> at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(
> CommandBuilderUtils.java:380)
>
> at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(
> YarnCommandBuilderUtils.scala:38)
>
> at org.apache.spark.deploy.yarn.Client.prepareLocalResources(
> Client.scala:570)
>
> at org.apache.spark.deploy.yarn.Client.
> createContainerLaunchContext(Client.scala:895)
>
> at org.apache.spark.deploy.yarn.Client.submitApplication(
> Client.scala:171)
>
> at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.
> start(YarnClientSchedulerBackend.scala:56)
>
> at org.apache.spark.scheduler.TaskSchedulerImpl.start(
> TaskSchedulerImpl.scala:156)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:509)
>
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.
> scala:2320)
>
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 6.apply(SparkSession.scala:868)
>
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 6.apply(SparkSession.scala:860)
>
> at scala.Option.getOrElse(Option.scala:121)
>
> at org.apache.spark.sql.SparkSession$Builder.
> getOrCreate(SparkSession.scala:860)
>
> at SparkHiveUDTF.sparkJob(SparkHiveUDTF.java:97)
>
> at SparkHiveUDTF.process(SparkHiveUDTF.java:78)
>
> at org.apache.hadoop.hive.ql.exec.UDTFOperator.process(
> UDTFOperator.java:109)
>
> at org.apache.hadoop.hive.ql.exec.Operator.forward(
> Operator.java:841)
>
> at org.apache.hadoop.hive.ql.exec.SelectOperator.process(
> SelectOperator.java:88)
>
> at org.apache.hadoop.hive.ql.exec.Operator.forward(
> Operator.java:841)
>
>     at org.apache.hadoop.hive.ql.exec.TableScanOperator.
> process(TableScanOperator.java:133)
>
> at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.
> forward(MapOperator.java:170)
>
> at org.apache.hadoop.hive.ql.exec.MapOperator.process(
> MapOperator.java:555)
>
> ... 18 more
>
>
>
>
>
>
>
>
>



-- 
*  Regards*
*  Sandeep Nemuri*


Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Sandeep Nemuri
Well if you are using Hortonworks distribution there is Livy2 which is
compatible with Spark2 and scala 2.11.

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_command-line-installation/content/install_configure_livy2.html


On Sun, Jun 4, 2017 at 1:55 PM, kant kodali <kanth...@gmail.com> wrote:

> Hi,
>
> Thanks for this but here is what the documentation says:
>
> "To run the Livy server, you will also need an Apache Spark installation.
> You can get Spark releases at https://spark.apache.org/downloads.html.
> Livy requires at least Spark 1.4 and currently only supports Scala 2.10
> builds of Spark. To run Livy with local sessions, first export these
> variables:"
>
> I am using spark 2.1.1 and scala 2.11.8 and I would like to use Dataframes
> and Dataset API so it sounds like this is not an option for me?
>
> Thanks!
>
> On Sun, Jun 4, 2017 at 12:23 AM, Sandeep Nemuri <nhsande...@gmail.com>
> wrote:
>
>> Check out http://livy.io/
>>
>>
>> On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am wondering what is the easiest way for a Micro service to query data
>>> on HDFS? By easiest way I mean using minimal number of tools.
>>>
>>> Currently I use spark structured streaming to do some real time
>>> aggregations and store it in HDFS. But now, I want my Micro service app to
>>> be able to query and access data on HDFS. It looks like SparkSession can
>>> only be accessed through CLI but not through a JDBC like API or whatever.
>>> Any suggestions?
>>>
>>> Thanks!
>>>
>>
>>
>>
>> --
>> *  Regards*
>> *  Sandeep Nemuri*
>>
>
>


-- 
*  Regards*
*  Sandeep Nemuri*


Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Sandeep Nemuri
Check out http://livy.io/


On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> I am wondering what is the easiest way for a Micro service to query data
> on HDFS? By easiest way I mean using minimal number of tools.
>
> Currently I use spark structured streaming to do some real time
> aggregations and store it in HDFS. But now, I want my Micro service app to
> be able to query and access data on HDFS. It looks like SparkSession can
> only be accessed through CLI but not through a JDBC like API or whatever.
> Any suggestions?
>
> Thanks!
>



-- 
*  Regards*
*  Sandeep Nemuri*


Re: spark-submit config via file

2017-03-27 Thread Sandeep Nemuri
t$$distribute$1(Client.scala:480)
>>
>> at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Cl
>> ient.scala:552)
>>
>> at org.apache.spark.deploy.yarn.Client.createContainerLaunchCon
>> text(Client.scala:881)
>>
>> at org.apache.spark.deploy.yarn.Client.submitApplication(Client
>> .scala:170)
>>
>> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1218)
>>
>> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1277)
>>
>> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>>
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:498)
>>
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:745)
>>
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:187)
>>
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.
>> scala:212)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:
>> 126)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> 17/03/24 11:36:27 INFO MetricsSystemImpl: Stopping azure-file-system
>> metrics system...
>>
>> Anyone know is this is even possible ?
>>
>>
>> Thanks...
>>
>> Roy
>>
>
>


-- 
*  Regards*
*  Sandeep Nemuri*


Re: Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread Sandeep Nemuri
Hi Aashish,

Do you have checkpointing enabled ? if not, Can you try enabling
checkpointing and observe the memory pattern.

Thanks,
Sandeep
ᐧ

On Tue, Aug 9, 2016 at 4:25 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Aashish,
>
> You are running in standalone mode with one node
>
> As I read you start master and 5 workers pop up from
> SPARK_WORKER_INSTANCES=5. I gather you use start-slaves.sh?
>
> Now that is the number of workers and low memory on them port 8080 should
> show practically no memory used (idle). Also every worker has been
> allocated 1 core SPARK_WORKER_CORE=1
>
> Now it all depends how you start your start-submit job and what parameters
> you pass to it.
>
> ${SPARK_HOME}/bin/spark-submit \
> --driver-memory 1G \
> --num-executors 2 \
> --executor-cores 1 \
> --executor-memory 1G \
> --master spark://:7077 \
>
> What are your parameters here? From my experience standalone mode has mind
> of its own and it does not follow what you have asked.
>
> If you increase the number of cores for workers, you may reduce the memory
> issue because effectively multiple tasks can be run on sub-set of your data.
>
> HTH
>
> P.S. I don't use SPARK_MASTER_OPTS
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 9 August 2016 at 11:21, aasish.kumar <aasish.ku...@avekshaa.com> wrote:
>
>> Hi,
>>
>> I am running spark v 1.6.1 on a single machine in standalone mode, having
>> 64GB RAM and 16cores.
>>
>> I have created five worker instances to create five executor as in
>> standalone mode, there cannot be more than one executor in one worker
>> node.
>>
>> *Configuration*:
>>
>> SPARK_WORKER_INSTANCES 5
>> SPARK_WORKER_CORE 1
>> SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
>>
>> all other configurations are default in spark_env.sh
>>
>> I am running a spark streaming direct kafka job at an interval of 1 min,
>> which takes data from kafka and after some aggregation write the data to
>> mongo.
>>
>> *Problems:*
>>
>> > when I start master and slave, it starts one master process and five
>> > worker processes. each only consume about 212 MB of ram.when i submit
>> the
>> > job , it again creates 5 executor processes and 1 job process and also
>> the
>> > memory uses grows to 8GB in total and keeps growing over time (slowly)
>> > also when there is no data to process.
>>
>> I am also unpersisting cached rdd at the end also set spark.cleaner.ttl to
>> 600. but still memory is growing.
>>
>> > one more thing, I have seen the merged SPARK-1706, then also why i am
>> > unable to create multiple executor within a worker.and also in
>> > spark_env.sh file , setting any configuration related to executor comes
>> > under YARN only mode.
>>
>> I have also tried running example program but same problem.
>>
>> Any help would be greatly appreciated,
>>
>> Thanks
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-Streaming-Job-Keeps-growing-memo
>> ry-over-time-tp27498.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
*  Regards*
*  Sandeep Nemuri*


Re: Stop Spark Streaming Jobs

2016-08-04 Thread Sandeep Nemuri
Also set spark.streaming.stopGracefullyOnShutdown to true
If true, Spark shuts down the StreamingContext gracefully on JVM shutdown
rather than immediately.

http://spark.apache.org/docs/latest/configuration.html#spark-streaming










ᐧ

On Thu, Aug 4, 2016 at 12:31 PM, Sandeep Nemuri <nhsande...@gmail.com>
wrote:

> StreamingContext.stop(...) if using scala
> JavaStreamingContext.stop(...) if using Java
>
> ᐧ
>
> On Wed, Aug 3, 2016 at 9:14 PM, Tony Lane <tonylane@gmail.com> wrote:
>
>> SparkSession exposes stop() method
>>
>> On Wed, Aug 3, 2016 at 8:53 AM, Pradeep <pradeep.mi...@mail.com> wrote:
>>
>>> Thanks Park. I am doing the same. Was trying to understand if there are
>>> other ways.
>>>
>>> Thanks,
>>> Pradeep
>>>
>>> > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee <kh1979.p...@samsung.com>
>>> wrote:
>>> >
>>> > So sorry. Your name was Pradeep !!
>>> >
>>> > -Original Message-
>>> > From: Park Kyeong Hee [mailto:kh1979.p...@samsung.com]
>>> > Sent: Wednesday, August 03, 2016 11:24 AM
>>> > To: 'Pradeep'; 'user@spark.apache.org'
>>> > Subject: RE: Stop Spark Streaming Jobs
>>> >
>>> > Hi. Paradeep
>>> >
>>> >
>>> > Did you mean, how to kill the job?
>>> > If yes, you should kill the driver and follow next.
>>> >
>>> > on yarn-client
>>> > 1. find pid - "ps -es | grep "
>>> > 2. kill it - "kill -9 "
>>> > 3. check executors were down - "yarn application -list"
>>> >
>>> > on yarn-cluster
>>> > 1. find driver's application ID - "yarn application -list"
>>> > 2. stop it - "yarn application -kill "
>>> > 3. check driver and executors were down - "yarn application -list"
>>> >
>>> >
>>> > Thanks.
>>> >
>>> > -Original Message-
>>> > From: Pradeep [mailto:pradeep.mi...@mail.com]
>>> > Sent: Wednesday, August 03, 2016 10:48 AM
>>> > To: user@spark.apache.org
>>> > Subject: Stop Spark Streaming Jobs
>>> >
>>> > Hi All,
>>> >
>>> > My streaming job reads data from Kafka. The job is triggered and
>>> pushed to
>>> > background with nohup.
>>> >
>>> > What are the recommended ways to stop job either on yarn-client or
>>> cluster
>>> > mode.
>>> >
>>> > Thanks,
>>> > Pradeep
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>> >
>>> >
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *  Regards*
> *  Sandeep Nemuri*
>



-- 
*  Regards*
*  Sandeep Nemuri*


Re: Stop Spark Streaming Jobs

2016-08-04 Thread Sandeep Nemuri
StreamingContext.stop(...) if using scala
JavaStreamingContext.stop(...) if using Java

ᐧ

On Wed, Aug 3, 2016 at 9:14 PM, Tony Lane <tonylane@gmail.com> wrote:

> SparkSession exposes stop() method
>
> On Wed, Aug 3, 2016 at 8:53 AM, Pradeep <pradeep.mi...@mail.com> wrote:
>
>> Thanks Park. I am doing the same. Was trying to understand if there are
>> other ways.
>>
>> Thanks,
>> Pradeep
>>
>> > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee <kh1979.p...@samsung.com>
>> wrote:
>> >
>> > So sorry. Your name was Pradeep !!
>> >
>> > -Original Message-
>> > From: Park Kyeong Hee [mailto:kh1979.p...@samsung.com]
>> > Sent: Wednesday, August 03, 2016 11:24 AM
>> > To: 'Pradeep'; 'user@spark.apache.org'
>> > Subject: RE: Stop Spark Streaming Jobs
>> >
>> > Hi. Paradeep
>> >
>> >
>> > Did you mean, how to kill the job?
>> > If yes, you should kill the driver and follow next.
>> >
>> > on yarn-client
>> > 1. find pid - "ps -es | grep "
>> > 2. kill it - "kill -9 "
>> > 3. check executors were down - "yarn application -list"
>> >
>> > on yarn-cluster
>> > 1. find driver's application ID - "yarn application -list"
>> > 2. stop it - "yarn application -kill "
>> > 3. check driver and executors were down - "yarn application -list"
>> >
>> >
>> > Thanks.
>> >
>> > -Original Message-
>> > From: Pradeep [mailto:pradeep.mi...@mail.com]
>> > Sent: Wednesday, August 03, 2016 10:48 AM
>> > To: user@spark.apache.org
>> > Subject: Stop Spark Streaming Jobs
>> >
>> > Hi All,
>> >
>> > My streaming job reads data from Kafka. The job is triggered and pushed
>> to
>> > background with nohup.
>> >
>> > What are the recommended ways to stop job either on yarn-client or
>> cluster
>> > mode.
>> >
>> > Thanks,
>> > Pradeep
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>> >
>> >
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
*  Regards*
*  Sandeep Nemuri*


Re: data frame or RDD for machine learning

2016-06-09 Thread Sandeep Nemuri
Please refer : http://spark.apache.org/docs/latest/mllib-guide.html

~Sandeep

On Thursday 9 June 2016, Jacek Laskowski  wrote:

> Hi,
>
> Use DataFrame-based API (aka spark.ml) first and if your ml algorithm
> doesn't support it switch to a RDD-based API (spark.mllib). What algorithm
> are you going to use?
>
> Jacek
> On 9 Jun 2016 9:12 a.m., "pseudo oduesp"  > wrote:
>
>> Hi,
>> after spark 1.3 we have dataframe ( thanks good )  ,  instead rdd  :
>>
>>  in machine learning algorithmes we should  give him an RDD or dataframe?
>>
>> i mean when i build modele :
>>
>>
>>Model  = algoritme(rdd)
>> or
>>Model =  algorithme(df)
>>
>>
>>
>> if you have an  exemple with data frame i prefer work with it.
>>
>> thanks .
>>
>>

-- 
Sent from iPhone


Re: yarn-cluster mode error

2016-05-17 Thread Sandeep Nemuri
Can you post the complete stack trace ?
ᐧ

On Tue, May 17, 2016 at 7:00 PM, <spark@yahoo.com.invalid> wrote:

> Hi,
>
> i am getting error below while running application on yarn-cluster mode.
>
> *ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM*
>
> Anyone can suggest why i am getting this error message?
>
> Thanks
> Raj
>
>
>
>
> Sent from Yahoo Mail. Get the app <https://yho.com/148vdq>
>



-- 
*  Regards*
*  Sandeep Nemuri*


Re: parquet table in spark-sql

2016-05-03 Thread Sandeep Nemuri
We don't need any delimiters for Parquet file format.

ᐧ

On Tue, May 3, 2016 at 5:31 AM, Varadharajan Mukundan <srinath...@gmail.com>
wrote:

> Hi,
>
> Yes, it is not needed. Delimiters are need only for text files.
>
> On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote:
>
>> hi, I want to ask a question about parquet table in spark-sql table.
>>
>> I think that parquet have schema information in its own file.
>> so you don't need define row separator and column separator in
>> create-table DDL, like that:
>>
>> total_duration  BigInt)
>> ROW FORMAT DELIMITED
>>   FIELDS TERMINATED BY ','
>>   LINES TERMINATED BY '\n'
>>
>> can anyone give me a answer? thanks
>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> 
>
> "Experience is what you get when you didn't get what you wanted"
>-By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>



-- 
*  Regards*
*  Sandeep Nemuri*