Re: Isotonic Regression, run method overloaded Error

2016-07-10 Thread Yanbo Liang
Hi Swaroop,

Would you mind to share your code that others can help you to figure out
what caused this error?
I can run the isotonic regression examples well.

Thanks
Yanbo

2016-07-08 13:38 GMT-07:00 dsp :

> Hi I am trying to perform Isotonic Regression on a data set with 9 features
> and a label.
> When I run the algorithm similar to the way mentioned on MLlib page, I get
> the error saying
>
> /*error:* overloaded method value run with alternatives:
> (input: org.apache.spark.api.java.JavaRDD[(java.lang.Double,
> java.lang.Double,
>
> java.lang.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
> 
>   (input: org.apache.spark.rdd.RDD[(scala.Double, scala.Double,
> scala.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
>  cannot be applied to (org.apache.spark.rdd.RDD[(scala.Double,
> scala.Double,
> scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
> scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
> scala.Double)])
>  val model = new
> IsotonicRegression().setIsotonic(true).run(training)/
>
> For the may given in the sample code, it looks like it can be done only for
> dataset with a single feature because run() method can accept only three
> parameters leaving which already has a label and a default value leaving
> place for only one variable.
> So, How can this be done for multiple variables ?
>
> Regards,
> Swaroop
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Isotonic-Regression-run-method-overloaded-Error-tp27313.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing.
Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in
Spark 2.0.
We can say that MLlib will focus on the Dataset-based API for futher
development more accurately.

Thanks
Yanbo

2016-07-10 20:35 GMT-07:00 jinhong lu :

> Hi,
> Since the DataSet will be the major API in spark2.0,  why mllib will
> DataFrame-based, and 'future development will focus on the DataFrame-based
> API.’
>
>Any plan will change mllib form DataFrame-based to DataSet-based?
>
>
> =
> Thanks,
> lujinhong
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Chanh Le
Hi Ayan,
I tested It works fine but one more confuse is If my (technical) users want to 
write some code in zeppelin to apply thing into Hive table? 
Zeppelin and STS can’t share Spark Context that mean we need separated process? 
Is there anyway to use the same Spark Context of STS?

Regards,
Chanh


> On Jul 11, 2016, at 10:05 AM, Takeshi Yamamuro  wrote:
> 
> Hi,
> 
> ISTM multiple sparkcontexts are not recommended in spark.
> See: https://issues.apache.org/jira/browse/SPARK-2243 
> 
> 
> // maropu
> 
> 
> On Mon, Jul 11, 2016 at 12:01 PM, ayan guha  > wrote:
> Hi
> 
> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on 
> YARN for few months now without much issue. 
> 
> On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le  > wrote:
> Hi everybody,
> We are using Spark to query big data and currently we’re using Zeppelin to 
> provide a UI for technical users.
> Now we also need to provide a UI for business users so we use Oracle BI tools 
> and set up a Spark Thrift Server (STS) for it.
> 
> When I run both Zeppelin and STS throw error:
> 
> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4} 
> SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468204821905 
> started by scheduler org.apache.zeppelin.spark.SparkInterpreter835015739
>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing view acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing modify acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(giaosudau); users with modify permissions: Set(giaosudau)
>  INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Starting HTTP Server
>  INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4} Server.java[doStart]:272) 
> - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4} 
> AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:54818 
> 
>  INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Successfully started service 'HTTP class server' on port 54818.
>  INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4} 
> SparkInterpreter.java[createSparkContext]:233) - -- Create new 
> SparkContext local[*] ---
>  WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4} 
> Logging.scala[logWarning]:70) - Another SparkContext is being constructed (or 
> threw an exception in its constructor).  This may indicate an error, since 
> only one SparkContext may be running in this JVM (see SPARK-2243). The other 
> SparkContext was created at:
> 
> Is that mean I need to setup allow multiple context? Because It’s only test 
> in local with local mode If I deploy on mesos cluster what would happened?
> 
> Need you guys suggests some solutions for that. Thanks.
> 
> Chanh
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Problem connecting Zeppelin 0.6 to Spark Thrift Server

2016-07-10 Thread Mich Talebzadeh
Hi,


I can use JDBC connection to connect from Squirrel client to Spark thrift
server and this works fine.

I have Zeppelin 0.6.o that works OK with the default spark interpreter.

I configured JDBC interpreter to connect to Spark thrift server as follows

[image: Inline images 1]

I can use beeline to connect to STS no problem

beeline -u jdbc:hive2://rhes564:10055 -n hduser -p

Connecting to jdbc:hive2://rhes564:10055
Connected to: Spark SQL (version 1.6.1)
Driver: Hive JDBC (version 2.0.0)
16/07/11 05:39:19 [main]: WARN jdbc.HiveConnection: Request to set
autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.0.0 by Apache Hive

Now the problem I have when I use Zeppelin to connect to STS  with "show
databases", I I get the following error in jdbc log

 INFO [2016-07-11 05:43:27,132] ({pool-2-thread-4}
SchedulerFactory.java[jobFinished]:137) - Job
remoteInterpretJob_1468212207128 finished by scheduler
org.apache.zeppelin.jdbc.JDBCInterpreter121016079
 INFO [2016-07-11 05:43:54,413] ({pool-2-thread-6}
SchedulerFactory.java[jobStarted]:131) - Job
remoteInterpretJob_1468212234413 started by scheduler
org.apache.zeppelin.jdbc.JDBCInterpreter121016079
 INFO [2016-07-11 05:43:54,413] ({pool-2-thread-6}
JDBCInterpreter.java[interpret]:385) - Run SQL command 'show databases'
 INFO [2016-07-11 05:43:54,414] ({pool-2-thread-6}
JDBCInterpreter.java[interpret]:394) - PropertyKey: default, SQL command:
'show databases'
 INFO [2016-07-11 05:43:54,418] ({pool-2-thread-6}
JDBCInterpreter.java[getConnection]:219) -
ERROR [2016-07-11 05:43:54,418] ({pool-2-thread-6}
JDBCInterpreter.java[executeSql]:366) - Cannot run show databases
java.lang.ClassNotFoundException:
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at
org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:220)
at
org.apache.zeppelin.jdbc.JDBCInterpreter.getStatement(JDBCInterpreter.java:233)
at
org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:292)
at
org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:396)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
at
org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


BTW I made both the following jars available under
$ZEPPELIN_HOME/interpreter/jdbc

lrwxrwxrwx  1 hduser hadoop  37 Jul 10 14:35 hive-jdbc-2.0.0.jar ->
/usr/lib/hive/lib/hive-jdbc-2.0.0.jar
lrwxrwxrwx  1 hduser hadoop  36 Jul 10 19:56 hive-cli-2.0.0.jar ->
/usr/lib/hive/lib/hive-cli-2.0.0.jar


Appreciate any feedback


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: StreamingKmeans Spark doesn't work at all

2016-07-10 Thread Shuai Lin
I would suggest you run the scala version of the example first, so you can
tell whether it's a problem of the data you provided or a problem of the
java code.

On Mon, Jul 11, 2016 at 2:37 AM, Biplob Biswas 
wrote:

> Hi,
>
> I know i am asking again, but I tried running the same thing on mac as
> well as some answers on the internet suggested it could be an issue with
> the windows environment, but still nothing works.
>
> Can anyone atleast suggest whether its a bug with spark or is it something
> else?
>
> Would be really grateful! Thanks a lot.
>
> Thanks & Regards
> Biplob Biswas
>
> On Thu, Jul 7, 2016 at 5:21 PM, Biplob Biswas 
> wrote:
>
>> Hi,
>>
>> Can anyone care to please look into this issue?  I would really love some
>> assistance here.
>>
>> Thanks a lot.
>>
>> Thanks & Regards
>> Biplob Biswas
>>
>> On Tue, Jul 5, 2016 at 1:00 PM, Biplob Biswas 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> I implemented the streamingKmeans example provided in the spark website
>>> but
>>> in Java.
>>> The full implementation is here,
>>>
>>> http://pastebin.com/CJQfWNvk
>>>
>>> But i am not getting anything in the output except occasional timestamps
>>> like one below:
>>>
>>> ---
>>> Time: 1466176935000 ms
>>> ---
>>>
>>> Also, i have 2 directories:
>>> "D:\spark\streaming example\Data Sets\training"
>>> "D:\spark\streaming example\Data Sets\test"
>>>
>>> and inside these directories i have 1 file each "samplegpsdata_train.txt"
>>> and "samplegpsdata_test.txt" with training data having 500 datapoints and
>>> test data with 60 datapoints.
>>>
>>> I am very new to the spark systems and any help is highly appreciated.
>>>
>>>
>>> //---//
>>>
>>> Now, I also have now tried using the scala implementation available here:
>>>
>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala
>>>
>>>
>>> and even had the training and test file provided in the format specified
>>> in
>>> that file as follows:
>>>
>>>  * The rows of the training text files must be vector data in the form
>>>  * `[x1,x2,x3,...,xn]`
>>>  * Where n is the number of dimensions.
>>>  *
>>>  * The rows of the test text files must be labeled data in the form
>>>  * `(y,[x1,x2,x3,...,xn])`
>>>  * Where y is some identifier. n must be the same for train and test.
>>>
>>>
>>> But I still get no output on my eclipse window ... just the Time!
>>>
>>> Can anyone seriously help me with this?
>>>
>>> Thank you so much
>>> Biplob Biswas
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/StreamingKmeans-Spark-doesn-t-work-at-all-tp27286.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: KEYS file?

2016-07-10 Thread Shuai Lin
>
> at least links to the keys used to sign releases on the
> download page


+1 for that.

On Mon, Jul 11, 2016 at 3:35 AM, Phil Steitz  wrote:

> On 7/10/16 10:57 AM, Shuai Lin wrote:
> > Not sure where you see " 0x7C6C105FFC8ED089". I
>
> That's the key ID for the key below.
> > think the release is signed with the
> > key https://people.apache.org/keys/committer/pwendell.asc .
>
> Thanks!  That key matches.  The project should publish a KEYS file
> [1] or at least links to the keys used to sign releases on the
> download page.  Could be there is one somewhere and I just can't
> find it.
>
> Phil
>
> [1] http://www.apache.org/dev/release-signing.html#keys-policy
> >
> > I think this tutorial can be
> > helpful: http://www.apache.org/info/verification.html
> >
> > On Mon, Jul 11, 2016 at 12:57 AM, Phil Steitz
> > > wrote:
> >
> > I can't seem to find a link the the Spark KEYS file.  I am
> > trying to
> > validate the sigs on the 1.6.2 release artifacts and I need to
> > import 0x7C6C105FFC8ED089.  Is there a KEYS file available for
> > download somewhere?  Apologies if I am just missing an obvious
> > link.
> >
> > Phil
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > 
> >
> >
>
>
>


Spark logging

2016-07-10 Thread SamyaMaiti
Hi Team,

I have a spark application up & running on a 10 node Standalone cluster.

When i launch the application in cluster mode i am able to create separate
log file for driver & executors (common for all executors).

But, my requirement is to create separate log file for each executors. Is it
feasible?

I am using org.apache.log4j.Logger.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logging-tp27319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



mllib based on dataset or dataframe

2016-07-10 Thread jinhong lu
Hi,
Since the DataSet will be the major API in spark2.0,  why mllib will 
DataFrame-based, and 'future development will focus on the DataFrame-based API.’

   Any plan will change mllib form DataFrame-based to DataSet-based?


=
Thanks,
lujinhong


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark crashes with two parquet files

2016-07-10 Thread Takeshi Yamamuro
The log explicitly said "java.lang.OutOfMemoryError: Java heap space", so
you need to allocate more JVM memory for spark?

// maropu

On Mon, Jul 11, 2016 at 11:59 AM, Javier Rey  wrote:

> Also the problem appears when I used clause: unionAll
>
> 2016-07-10 21:58 GMT-05:00 Javier Rey :
>
>> This is a part of trace log.
>>
>>  WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 13, localhost):
>> java.lang.OutOfMemoryError: Java heap space
>> at
>> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755)
>> at
>> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494)
>> at
>> org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.checkEndOfRowGroup(UnsafeRowParquetRecord
>>
>> 2016-07-10 21:47 GMT-05:00 Takeshi Yamamuro :
>>
>>> Hi,
>>>
>>> What's the schema in the parquets?
>>> Also, could you show us the stack trace when the error happens?
>>>
>>> // maropu
>>>
>>> On Mon, Jul 11, 2016 at 11:42 AM, Javier Rey  wrote:
>>>
 Hi everybody,

 I installed Spark 1.6.1, I have two parquet files, but when I need show
 registers using unionAll, Spark crash I don't understand what happens.

 But when I use show() only one parquet file this is work correctly.

 code with fault:

 path = '/data/train_parquet/'
 train_df = sqlContext.read.parquet(path)
 train_df.take(1)

 code works:

 path = '/data/train_parquet/0_0_0.parquet'
 train0_df = sqlContext.read.load(path)
 train_df.take(1)

 Thanks in advance.

 Samir

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Takeshi Yamamuro
Hi,

ISTM multiple sparkcontexts are not recommended in spark.
See: https://issues.apache.org/jira/browse/SPARK-2243

// maropu


On Mon, Jul 11, 2016 at 12:01 PM, ayan guha  wrote:

> Hi
>
> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on
> YARN for few months now without much issue.
>
> On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le  wrote:
>
>> Hi everybody,
>> We are using Spark to query big data and currently we’re using Zeppelin
>> to provide a UI for technical users.
>> Now we also need to provide a UI for business users so we use Oracle BI
>> tools and set up a Spark Thrift Server (STS) for it.
>>
>> When I run both Zeppelin and STS throw error:
>>
>> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4}
>> SchedulerFactory.java[jobStarted]:131) - Job
>> remoteInterpretJob_1468204821905 started by scheduler
>> org.apache.zeppelin.spark.SparkInterpreter835015739
>>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4}
>> Logging.scala[logInfo]:58) - Changing view acls to: giaosudau
>>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4}
>> Logging.scala[logInfo]:58) - Changing modify acls to: giaosudau
>>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4}
>> Logging.scala[logInfo]:58) - SecurityManager: authentication disabled; ui
>> acls disabled; users with view permissions: Set(giaosudau); users with
>> modify permissions: Set(giaosudau)
>>  INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4}
>> Logging.scala[logInfo]:58) - Starting HTTP Server
>>  INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4}
>> Server.java[doStart]:272) - jetty-8.y.z-SNAPSHOT
>>  INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4}
>> AbstractConnector.java[doStart]:338) - Started
>> SocketConnector@0.0.0.0:54818
>>  INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4}
>> Logging.scala[logInfo]:58) - Successfully started service 'HTTP class
>> server' on port 54818.
>>  INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4}
>> SparkInterpreter.java[createSparkContext]:233) - -- Create new
>> SparkContext local[*] ---
>>  WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4}
>> Logging.scala[logWarning]:70) - Another SparkContext is being constructed
>> (or threw an exception in its constructor).  This may indicate an error,
>> since only one SparkContext may be running in this JVM (see SPARK-2243).
>> The other SparkContext was created at:
>>
>> Is that mean I need to setup allow multiple context? Because It’s only
>> test in local with local mode If I deploy on mesos cluster what would
>> happened?
>>
>> Need you guys suggests some solutions for that. Thanks.
>>
>> Chanh
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
---
Takeshi Yamamuro


Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Chanh Le
Hi Ayan,

It is brilliant idea. Thank you every much. I will try this way.

Regards,
Chanh


> On Jul 11, 2016, at 10:01 AM, ayan guha  wrote:
> 
> Hi
> 
> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on 
> YARN for few months now without much issue. 
> 
> On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le  > wrote:
> Hi everybody,
> We are using Spark to query big data and currently we’re using Zeppelin to 
> provide a UI for technical users.
> Now we also need to provide a UI for business users so we use Oracle BI tools 
> and set up a Spark Thrift Server (STS) for it.
> 
> When I run both Zeppelin and STS throw error:
> 
> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4} 
> SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468204821905 
> started by scheduler org.apache.zeppelin.spark.SparkInterpreter835015739
>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing view acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing modify acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(giaosudau); users with modify permissions: Set(giaosudau)
>  INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Starting HTTP Server
>  INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4} Server.java[doStart]:272) 
> - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4} 
> AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:54818 
> 
>  INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Successfully started service 'HTTP class server' on port 54818.
>  INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4} 
> SparkInterpreter.java[createSparkContext]:233) - -- Create new 
> SparkContext local[*] ---
>  WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4} 
> Logging.scala[logWarning]:70) - Another SparkContext is being constructed (or 
> threw an exception in its constructor).  This may indicate an error, since 
> only one SparkContext may be running in this JVM (see SPARK-2243). The other 
> SparkContext was created at:
> 
> Is that mean I need to setup allow multiple context? Because It’s only test 
> in local with local mode If I deploy on mesos cluster what would happened?
> 
> Need you guys suggests some solutions for that. Thanks.
> 
> Chanh
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha



Re: "client / server" config

2016-07-10 Thread ayan guha
Yes, that is expected to move on. If it looks it is waiting for something,
my first instinct would be to check network connectivity such as your
cluster must have access back to your Mac to read the file (it is probably
waiting to time out)

On Mon, Jul 11, 2016 at 12:59 PM, Jean Georges Perrin  wrote:

> Good for the file :)
>
> No it goes on... Like if it was waiting for something
>
> jg
>
>
> On Jul 10, 2016, at 22:55, ayan guha  wrote:
>
> Is this terminating the execution or spark application still runs after
> this error?
>
> One thing for sure, it is looking for local file on driver (ie your mac) @
> location: file:/Users/jgp/Documents/Data/restaurants-data.json
>
> On Mon, Jul 11, 2016 at 12:33 PM, Jean Georges Perrin  wrote:
>
>>
>> I have my dev environment on my Mac. I have a dev Spark server on a
>> freshly installed physical Ubuntu box.
>>
>> I had some connection issues, but it is now all fine.
>>
>> In my code, running on the Mac, I have:
>>
>> 1 SparkConf conf = new SparkConf().setAppName("myapp").setMaster("
>> spark://10.0.100.120:7077");
>> 2 JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>> 3 javaSparkContext.setLogLevel("WARN");
>> 4 SQLContext sqlContext = new SQLContext(javaSparkContext);
>> 5
>> 6 // Restaurant Data
>> 7 df = sqlContext.read().option("dateFormat", "-mm-dd").json(source
>> .getLocalStorage());
>>
>>
>> 1) Clarification question: This code runs on my mac, connects to the
>> server, but line #7 assumes the file is on my mac, not on the server, right?
>>
>> 2) On line 7, I get an exception:
>>
>> 16-07-10 22:20:04:143 DEBUG  - address: jgp-MacBook-Air.local/
>> 10.0.100.100 isLoopbackAddress: false, with host 10.0.100.100
>> jgp-MacBook-Air.local
>> 16-07-10 22:20:04:240 INFO
>> org.apache.spark.sql.execution.datasources.json.JSONRelation - Listing
>> file:/Users/jgp/Documents/Data/restaurants-data.json on driver
>> 16-07-10 22:20:04:288 DEBUG org.apache.hadoop.util.Shell - Failed to
>> detect a valid hadoop home directory
>> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
>> at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:225)
>> at org.apache.hadoop.util.Shell.(Shell.java:250)
>> at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(
>> FileInputFormat.java:447)
>> at org.apache.spark.sql.execution.datasources.json.JSONRelation.org
>> $apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
>> at
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
>> at
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
>> at scala.Option.getOrElse(Option.scala:120)
>> at
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
>> at
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
>> at scala.Option.getOrElse(Option.scala:120)
>>
>> Do I have to install HADOOP on the server? - I imagine that from:
>> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
>>
>> TIA,
>>
>> jg
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>


-- 
Best Regards,
Ayan Guha


Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread ayan guha
Hi

Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on
YARN for few months now without much issue.

On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le  wrote:

> Hi everybody,
> We are using Spark to query big data and currently we’re using Zeppelin to
> provide a UI for technical users.
> Now we also need to provide a UI for business users so we use Oracle BI
> tools and set up a Spark Thrift Server (STS) for it.
>
> When I run both Zeppelin and STS throw error:
>
> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4}
> SchedulerFactory.java[jobStarted]:131) - Job
> remoteInterpretJob_1468204821905 started by scheduler
> org.apache.zeppelin.spark.SparkInterpreter835015739
>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4}
> Logging.scala[logInfo]:58) - Changing view acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4}
> Logging.scala[logInfo]:58) - Changing modify acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4}
> Logging.scala[logInfo]:58) - SecurityManager: authentication disabled; ui
> acls disabled; users with view permissions: Set(giaosudau); users with
> modify permissions: Set(giaosudau)
>  INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4}
> Logging.scala[logInfo]:58) - Starting HTTP Server
>  INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4}
> Server.java[doStart]:272) - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4}
> AbstractConnector.java[doStart]:338) - Started
> SocketConnector@0.0.0.0:54818
>  INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4}
> Logging.scala[logInfo]:58) - Successfully started service 'HTTP class
> server' on port 54818.
>  INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4}
> SparkInterpreter.java[createSparkContext]:233) - -- Create new
> SparkContext local[*] ---
>  WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4}
> Logging.scala[logWarning]:70) - Another SparkContext is being constructed
> (or threw an exception in its constructor).  This may indicate an error,
> since only one SparkContext may be running in this JVM (see SPARK-2243).
> The other SparkContext was created at:
>
> Is that mean I need to setup allow multiple context? Because It’s only
> test in local with local mode If I deploy on mesos cluster what would
> happened?
>
> Need you guys suggests some solutions for that. Thanks.
>
> Chanh
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


Re: "client / server" config

2016-07-10 Thread Jean Georges Perrin
Good for the file :)

No it goes on... Like if it was waiting for something

jg


> On Jul 10, 2016, at 22:55, ayan guha  wrote:
> 
> Is this terminating the execution or spark application still runs after this 
> error?
> 
> One thing for sure, it is looking for local file on driver (ie your mac) @ 
> location: file:/Users/jgp/Documents/Data/restaurants-data.json 
> 
>> On Mon, Jul 11, 2016 at 12:33 PM, Jean Georges Perrin  wrote:
>> 
>> I have my dev environment on my Mac. I have a dev Spark server on a freshly 
>> installed physical Ubuntu box.
>> 
>> I had some connection issues, but it is now all fine.
>> 
>> In my code, running on the Mac, I have:
>> 
>>  1   SparkConf conf = new 
>> SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:7077");
>>  2   JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>  3   javaSparkContext.setLogLevel("WARN");
>>  4   SQLContext sqlContext = new SQLContext(javaSparkContext);
>>  5
>>  6   // Restaurant Data
>>  7   df = sqlContext.read().option("dateFormat", 
>> "-mm-dd").json(source.getLocalStorage());
>> 
>> 
>> 1) Clarification question: This code runs on my mac, connects to the server, 
>> but line #7 assumes the file is on my mac, not on the server, right?
>> 
>> 2) On line 7, I get an exception:
>> 
>> 16-07-10 22:20:04:143 DEBUG  - address: jgp-MacBook-Air.local/10.0.100.100 
>> isLoopbackAddress: false, with host 10.0.100.100 jgp-MacBook-Air.local
>> 16-07-10 22:20:04:240 INFO 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation - Listing 
>> file:/Users/jgp/Documents/Data/restaurants-data.json on driver
>> 16-07-10 22:20:04:288 DEBUG org.apache.hadoop.util.Shell - Failed to detect 
>> a valid hadoop home directory
>> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
>>  at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:225)
>>  at org.apache.hadoop.util.Shell.(Shell.java:250)
>>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>>  at 
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:447)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
>>  at scala.Option.getOrElse(Option.scala:120)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
>>  at scala.Option.getOrElse(Option.scala:120)
>> 
>> Do I have to install HADOOP on the server? - I imagine that from:
>> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
>> 
>> TIA,
>> 
>> jg
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha


Re: "client / server" config

2016-07-10 Thread ayan guha
Is this terminating the execution or spark application still runs after
this error?

One thing for sure, it is looking for local file on driver (ie your mac) @
location: file:/Users/jgp/Documents/Data/restaurants-data.json

On Mon, Jul 11, 2016 at 12:33 PM, Jean Georges Perrin  wrote:

>
> I have my dev environment on my Mac. I have a dev Spark server on a
> freshly installed physical Ubuntu box.
>
> I had some connection issues, but it is now all fine.
>
> In my code, running on the Mac, I have:
>
> 1 SparkConf conf = new SparkConf().setAppName("myapp").setMaster("
> spark://10.0.100.120:7077");
> 2 JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
> 3 javaSparkContext.setLogLevel("WARN");
> 4 SQLContext sqlContext = new SQLContext(javaSparkContext);
> 5
> 6 // Restaurant Data
> 7 df = sqlContext.read().option("dateFormat", "-mm-dd").json(source
> .getLocalStorage());
>
>
> 1) Clarification question: This code runs on my mac, connects to the
> server, but line #7 assumes the file is on my mac, not on the server, right?
>
> 2) On line 7, I get an exception:
>
> 16-07-10 22:20:04:143 DEBUG  - address: jgp-MacBook-Air.local/10.0.100.100
> isLoopbackAddress: false, with host 10.0.100.100 jgp-MacBook-Air.local
> 16-07-10 22:20:04:240 INFO
> org.apache.spark.sql.execution.datasources.json.JSONRelation - Listing
> file:/Users/jgp/Documents/Data/restaurants-data.json on driver
> 16-07-10 22:20:04:288 DEBUG org.apache.hadoop.util.Shell - Failed to
> detect a valid hadoop home directory
> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
> at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:225)
> at org.apache.hadoop.util.Shell.(Shell.java:250)
> at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(
> FileInputFormat.java:447)
> at org.apache.spark.sql.execution.datasources.json.JSONRelation.org
> $apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
> at scala.Option.getOrElse(Option.scala:120)
>
> Do I have to install HADOOP on the server? - I imagine that from:
> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
>
> TIA,
>
> jg
>
>


-- 
Best Regards,
Ayan Guha


How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Chanh Le
Hi everybody,
We are using Spark to query big data and currently we’re using Zeppelin to 
provide a UI for technical users.
Now we also need to provide a UI for business users so we use Oracle BI tools 
and set up a Spark Thrift Server (STS) for it.

When I run both Zeppelin and STS throw error:

INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4} 
SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468204821905 
started by scheduler org.apache.zeppelin.spark.SparkInterpreter835015739
 INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
Changing view acls to: giaosudau
 INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
Changing modify acls to: giaosudau
 INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
SecurityManager: authentication disabled; ui acls disabled; users with view 
permissions: Set(giaosudau); users with modify permissions: Set(giaosudau)
 INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
Starting HTTP Server
 INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4} Server.java[doStart]:272) - 
jetty-8.y.z-SNAPSHOT
 INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4} 
AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:54818
 INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
Successfully started service 'HTTP class server' on port 54818.
 INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4} 
SparkInterpreter.java[createSparkContext]:233) - -- Create new SparkContext 
local[*] ---
 WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4} 
Logging.scala[logWarning]:70) - Another SparkContext is being constructed (or 
threw an exception in its constructor).  This may indicate an error, since only 
one SparkContext may be running in this JVM (see SPARK-2243). The other 
SparkContext was created at:

Is that mean I need to setup allow multiple context? Because It’s only test in 
local with local mode If I deploy on mesos cluster what would happened?

Need you guys suggests some solutions for that. Thanks.

Chanh
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark crashes with two parquet files

2016-07-10 Thread Takeshi Yamamuro
Hi,

What's the schema in the parquets?
Also, could you show us the stack trace when the error happens?

// maropu

On Mon, Jul 11, 2016 at 11:42 AM, Javier Rey  wrote:

> Hi everybody,
>
> I installed Spark 1.6.1, I have two parquet files, but when I need show
> registers using unionAll, Spark crash I don't understand what happens.
>
> But when I use show() only one parquet file this is work correctly.
>
> code with fault:
>
> path = '/data/train_parquet/'
> train_df = sqlContext.read.parquet(path)
> train_df.take(1)
>
> code works:
>
> path = '/data/train_parquet/0_0_0.parquet'
> train0_df = sqlContext.read.load(path)
> train_df.take(1)
>
> Thanks in advance.
>
> Samir
>



-- 
---
Takeshi Yamamuro


Spark crashes with two parquet files

2016-07-10 Thread Javier Rey
Hi everybody,

I installed Spark 1.6.1, I have two parquet files, but when I need show
registers using unionAll, Spark crash I don't understand what happens.

But when I use show() only one parquet file this is work correctly.

code with fault:

path = '/data/train_parquet/'
train_df = sqlContext.read.parquet(path)
train_df.take(1)

code works:

path = '/data/train_parquet/0_0_0.parquet'
train0_df = sqlContext.read.load(path)
train_df.take(1)

Thanks in advance.

Samir


"client / server" config

2016-07-10 Thread Jean Georges Perrin

I have my dev environment on my Mac. I have a dev Spark server on a freshly 
installed physical Ubuntu box.

I had some connection issues, but it is now all fine.

In my code, running on the Mac, I have:

1   SparkConf conf = new 
SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:7077");
2   JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
3   javaSparkContext.setLogLevel("WARN");
4   SQLContext sqlContext = new SQLContext(javaSparkContext);
5
6   // Restaurant Data
7   df = sqlContext.read().option("dateFormat", 
"-mm-dd").json(source.getLocalStorage());


1) Clarification question: This code runs on my mac, connects to the server, 
but line #7 assumes the file is on my mac, not on the server, right?

2) On line 7, I get an exception:

16-07-10 22:20:04:143 DEBUG  - address: jgp-MacBook-Air.local/10.0.100.100 
isLoopbackAddress: false, with host 10.0.100.100 jgp-MacBook-Air.local
16-07-10 22:20:04:240 INFO 
org.apache.spark.sql.execution.datasources.json.JSONRelation - Listing 
file:/Users/jgp/Documents/Data/restaurants-data.json on driver
16-07-10 22:20:04:288 DEBUG org.apache.hadoop.util.Shell - Failed to detect a 
valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:225)
at org.apache.hadoop.util.Shell.(Shell.java:250)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:447)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
at scala.Option.getOrElse(Option.scala:120)

Do I have to install HADOOP on the server? - I imagine that from:
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.

TIA,

jg



Re: Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
It appears like i had issues in my /etc/hosts... it seems ok now

> On Jul 10, 2016, at 2:13 PM, Jean Georges Perrin  wrote:
> 
> I tested that:
> 
> I set:
> 
> _JAVA_OPTIONS=-Djava.net.preferIPv4Stack=true
> SPARK_LOCAL_IP=10.0.100.120
> I still have the warning in the log:
> 
> 16/07/10 14:10:13 WARN Utils: Your hostname, micha resolves to a loopback 
> address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
> 16/07/10 14:10:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> and still connection refused...
> 
> but no luck
> 
>> On Jul 10, 2016, at 1:26 PM, Jean Georges Perrin > > wrote:
>> 
>> Hi,
>> 
>> So far I have been using Spark "embedded" in my app. Now, I'd like to run it 
>> on a dedicated server.
>> 
>> I am that far:
>> - fresh ubuntu 16, server name is mocha / ip 10.0.100.120, installed scala 
>> 2.10, installed Spark 1.6.2, recompiled
>> - Pi test works
>> - UI on port 8080 works
>> 
>> Log says:
>> Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
>> /opt/apache-spark-1.6.2/conf/:/opt/apache-spark-1.6.2/assembly/target/scala-2.10/spark-assembly-1.6.2-hadoop2.2.0.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-core-3.2.10.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
>>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip micha --port 7077 
>> --webui-port 8080
>> 
>> Using Spark's default log4j profile: 
>> org/apache/spark/log4j-defaults.properties
>> 16/07/10 13:03:55 INFO Master: Registered signal handlers for [TERM, HUP, 
>> INT]
>> 16/07/10 13:03:55 WARN Utils: Your hostname, micha resolves to a loopback 
>> address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
>> 16/07/10 13:03:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
>> another address
>> 16/07/10 13:03:55 WARN NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> 16/07/10 13:03:55 INFO SecurityManager: Changing view acls to: root
>> 16/07/10 13:03:55 INFO SecurityManager: Changing modify acls to: root
>> 16/07/10 13:03:55 INFO SecurityManager: SecurityManager: authentication 
>> disabled; ui acls disabled; users with view permissions: Set(root); users 
>> with modify permissions: Set(root)
>> 16/07/10 13:03:56 INFO Utils: Successfully started service 'sparkMaster' on 
>> port 7077.
>> 16/07/10 13:03:56 INFO Master: Starting Spark master at spark://micha:7077 
>> 
>> 16/07/10 13:03:56 INFO Master: Running Spark version 1.6.2
>> 16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
>> 16/07/10 13:03:56 INFO AbstractConnector: Started 
>> SelectChannelConnector@0.0.0.0 :8080
>> 16/07/10 13:03:56 INFO Utils: Successfully started service 'MasterUI' on 
>> port 8080.
>> 16/07/10 13:03:56 INFO MasterWebUI: Started MasterWebUI at 
>> http://10.0.100.120:8080 
>> 16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
>> 16/07/10 13:03:56 INFO AbstractConnector: Started 
>> SelectChannelConnector@micha:6066
>> 16/07/10 13:03:56 INFO Utils: Successfully started service on port 6066.
>> 16/07/10 13:03:56 INFO StandaloneRestServer: Started REST server for 
>> submitting applications on port 6066
>> 16/07/10 13:03:56 INFO Master: I have been elected leader! New state: ALIVE
>> 
>> 
>> In my app, i changed the config to:
>>  SparkConf conf = new 
>> SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:6066 
>> ");
>> 
>> 
>> (also tried 7077)
>> 
>> 
>> On the client:
>> 16-07-10 13:22:58:300 INFO org.spark-project.jetty.server.AbstractConnector 
>> - Started SelectChannelConnector@0.0.0.0 
>> :4040
>> 16-07-10 13:22:58:300 DEBUG 
>> org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
>> SelectChannelConnector@0.0.0.0 :4040
>> 16-07-10 13:22:58:300 DEBUG 
>> org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
>> org.spark-project.jetty.server.Server@3eb292cd
>> 16-07-10 13:22:58:301 INFO org.apache.spark.util.Utils - Successfully 
>> started service 'SparkUI' on port 4040.
>> 16-07-10 13:22:58:306 INFO org.apache.spark.ui.SparkUI - Started SparkUI at 
>> http://10.0.100.100:4040 
>> 16-07-10 13:22:58:621 INFO 
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint - Connecting to 
>> master spark://10.0.100.120:6066 ...
>> 16-07-10 13:22:58:648 DEBUG 
>> org.apache.spark.network.client.TransportClientFactory - Creating new 
>> connection to /10.0.100.120:6066
>> 16-07-10 13:22:58:689 DEBUG io.netty.util.ResourceLeakDetector - 
>> -Dio.netty.leakDetectionLevel: simple
>> 16-07-10 13:22:58:714 WARN 
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint - Failed to 

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Takeshi Yamamuro
Hi,

One of solutions to use `spark-csv` (See:
https://github.com/databricks/spark-csv#features).
To load NULL, you can use `nullValue` there.

// maropu


On Mon, Jul 11, 2016 at 1:14 AM, Radha krishna  wrote:

> I want to apply null comparison to a column in sqlcontext.sql, is there
> any way to achieve this?
> On Jul 10, 2016 8:55 PM, "Radha krishna"  wrote:
>
>> Ok thank you, how to achieve the requirement.
>>
>> On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen  wrote:
>>
>>> It doesn't look like you have a NULL field, You have a string-value
>>> field with an empty string.
>>>
>>> On Sun, Jul 10, 2016 at 3:19 PM, Radha krishna 
>>> wrote:
>>> > Hi All,IS NOT NULL is not working in programmatic sql. check below for
>>> input
>>> > output and code.
>>> >
>>> > Input
>>> > 
>>> > 10,IN
>>> > 11,PK
>>> > 12,US
>>> > 13,UK
>>> > 14,US
>>> > 15,IN
>>> > 16,
>>> > 17,AS
>>> > 18,AS
>>> > 19,IR
>>> > 20,As
>>> >
>>> > val cntdat = sc.textFile("/user/poc_hortonworks/radha/gsd/sample.txt");
>>> > case class CNT (id:Int , code : String)
>>> > val cntdf = cntdat.map((f) => { val ff=f.split(",");new
>>> > CNT(ff(0).toInt,ff(1))}).toDF
>>> > cntdf.registerTempTable("cntids");
>>> > sqlContext.sql("SELECT MAX(id),code FROM cntids WHERE code is not null
>>> GROUP
>>> > BY code").show()
>>> >
>>> > Output
>>> > =
>>> > +---++
>>> > |_c0|code|
>>> > +---++
>>> > | 18|  AS|
>>> > | 16|  |
>>> > | 13|  UK|
>>> > | 14|  US|
>>> > | 20|  As|
>>> > | 15|  IN|
>>> > | 19|  IR|
>>> > | 11|  PK|
>>> > +---++
>>> >
>>> > i am expecting the below one any idea, how to apply IS NOT NULL ?
>>> >
>>> > +---++
>>> > |_c0|code|
>>> > +---++
>>> > | 18|  AS|
>>> > | 13|  UK|
>>> > | 14|  US|
>>> > | 20|  As|
>>> > | 15|  IN|
>>> > | 19|  IR|
>>> > | 11|  PK|
>>> > +---++
>>> >
>>> >
>>> >
>>> > Thanks & Regards
>>> >Radha krishna
>>> >
>>> >
>>>
>>
>>
>>
>> --
>>
>>
>>
>>
>>
>>
>>
>>
>> Thanks & Regards
>>Radha krishna
>>
>>
>>


-- 
---
Takeshi Yamamuro


Re: KEYS file?

2016-07-10 Thread Phil Steitz
On 7/10/16 10:57 AM, Shuai Lin wrote:
> Not sure where you see " 0x7C6C105FFC8ED089". I

That's the key ID for the key below.
> think the release is signed with the
> key https://people.apache.org/keys/committer/pwendell.asc .

Thanks!  That key matches.  The project should publish a KEYS file
[1] or at least links to the keys used to sign releases on the
download page.  Could be there is one somewhere and I just can't
find it.

Phil

[1] http://www.apache.org/dev/release-signing.html#keys-policy
>
> I think this tutorial can be
> helpful: http://www.apache.org/info/verification.html
>
> On Mon, Jul 11, 2016 at 12:57 AM, Phil Steitz
> > wrote:
>
> I can't seem to find a link the the Spark KEYS file.  I am
> trying to
> validate the sigs on the 1.6.2 release artifacts and I need to
> import 0x7C6C105FFC8ED089.  Is there a KEYS file available for
> download somewhere?  Apologies if I am just missing an obvious
> link.
>
> Phil
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
>
>



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: StreamingKmeans Spark doesn't work at all

2016-07-10 Thread Biplob Biswas
Hi,

I know i am asking again, but I tried running the same thing on mac as well
as some answers on the internet suggested it could be an issue with the
windows environment, but still nothing works.

Can anyone atleast suggest whether its a bug with spark or is it something
else?

Would be really grateful! Thanks a lot.

Thanks & Regards
Biplob Biswas

On Thu, Jul 7, 2016 at 5:21 PM, Biplob Biswas 
wrote:

> Hi,
>
> Can anyone care to please look into this issue?  I would really love some
> assistance here.
>
> Thanks a lot.
>
> Thanks & Regards
> Biplob Biswas
>
> On Tue, Jul 5, 2016 at 1:00 PM, Biplob Biswas 
> wrote:
>
>>
>> Hi,
>>
>> I implemented the streamingKmeans example provided in the spark website
>> but
>> in Java.
>> The full implementation is here,
>>
>> http://pastebin.com/CJQfWNvk
>>
>> But i am not getting anything in the output except occasional timestamps
>> like one below:
>>
>> ---
>> Time: 1466176935000 ms
>> ---
>>
>> Also, i have 2 directories:
>> "D:\spark\streaming example\Data Sets\training"
>> "D:\spark\streaming example\Data Sets\test"
>>
>> and inside these directories i have 1 file each "samplegpsdata_train.txt"
>> and "samplegpsdata_test.txt" with training data having 500 datapoints and
>> test data with 60 datapoints.
>>
>> I am very new to the spark systems and any help is highly appreciated.
>>
>>
>> //---//
>>
>> Now, I also have now tried using the scala implementation available here:
>>
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala
>>
>>
>> and even had the training and test file provided in the format specified
>> in
>> that file as follows:
>>
>>  * The rows of the training text files must be vector data in the form
>>  * `[x1,x2,x3,...,xn]`
>>  * Where n is the number of dimensions.
>>  *
>>  * The rows of the test text files must be labeled data in the form
>>  * `(y,[x1,x2,x3,...,xn])`
>>  * Where y is some identifier. n must be the same for train and test.
>>
>>
>> But I still get no output on my eclipse window ... just the Time!
>>
>> Can anyone seriously help me with this?
>>
>> Thank you so much
>> Biplob Biswas
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/StreamingKmeans-Spark-doesn-t-work-at-all-tp27286.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
I tested that:

I set:

_JAVA_OPTIONS=-Djava.net.preferIPv4Stack=true
SPARK_LOCAL_IP=10.0.100.120
I still have the warning in the log:

16/07/10 14:10:13 WARN Utils: Your hostname, micha resolves to a loopback 
address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
16/07/10 14:10:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
and still connection refused...

but no luck

> On Jul 10, 2016, at 1:26 PM, Jean Georges Perrin  wrote:
> 
> Hi,
> 
> So far I have been using Spark "embedded" in my app. Now, I'd like to run it 
> on a dedicated server.
> 
> I am that far:
> - fresh ubuntu 16, server name is mocha / ip 10.0.100.120, installed scala 
> 2.10, installed Spark 1.6.2, recompiled
> - Pi test works
> - UI on port 8080 works
> 
> Log says:
> Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
> /opt/apache-spark-1.6.2/conf/:/opt/apache-spark-1.6.2/assembly/target/scala-2.10/spark-assembly-1.6.2-hadoop2.2.0.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-core-3.2.10.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip micha --port 7077 
> --webui-port 8080
> 
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 16/07/10 13:03:55 INFO Master: Registered signal handlers for [TERM, HUP, INT]
> 16/07/10 13:03:55 WARN Utils: Your hostname, micha resolves to a loopback 
> address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
> 16/07/10 13:03:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 16/07/10 13:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/07/10 13:03:55 INFO SecurityManager: Changing view acls to: root
> 16/07/10 13:03:55 INFO SecurityManager: Changing modify acls to: root
> 16/07/10 13:03:55 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 16/07/10 13:03:56 INFO Utils: Successfully started service 'sparkMaster' on 
> port 7077.
> 16/07/10 13:03:56 INFO Master: Starting Spark master at spark://micha:7077 
> 
> 16/07/10 13:03:56 INFO Master: Running Spark version 1.6.2
> 16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
> 16/07/10 13:03:56 INFO AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0 :8080
> 16/07/10 13:03:56 INFO Utils: Successfully started service 'MasterUI' on port 
> 8080.
> 16/07/10 13:03:56 INFO MasterWebUI: Started MasterWebUI at 
> http://10.0.100.120:8080 
> 16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
> 16/07/10 13:03:56 INFO AbstractConnector: Started 
> SelectChannelConnector@micha:6066
> 16/07/10 13:03:56 INFO Utils: Successfully started service on port 6066.
> 16/07/10 13:03:56 INFO StandaloneRestServer: Started REST server for 
> submitting applications on port 6066
> 16/07/10 13:03:56 INFO Master: I have been elected leader! New state: ALIVE
> 
> 
> In my app, i changed the config to:
>   SparkConf conf = new 
> SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:6066 
> ");
> 
> 
> (also tried 7077)
> 
> 
> On the client:
> 16-07-10 13:22:58:300 INFO org.spark-project.jetty.server.AbstractConnector - 
> Started SelectChannelConnector@0.0.0.0 
> :4040
> 16-07-10 13:22:58:300 DEBUG 
> org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
> SelectChannelConnector@0.0.0.0 :4040
> 16-07-10 13:22:58:300 DEBUG 
> org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
> org.spark-project.jetty.server.Server@3eb292cd
> 16-07-10 13:22:58:301 INFO org.apache.spark.util.Utils - Successfully started 
> service 'SparkUI' on port 4040.
> 16-07-10 13:22:58:306 INFO org.apache.spark.ui.SparkUI - Started SparkUI at 
> http://10.0.100.100:4040 
> 16-07-10 13:22:58:621 INFO 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint - Connecting to 
> master spark://10.0.100.120:6066 ...
> 16-07-10 13:22:58:648 DEBUG 
> org.apache.spark.network.client.TransportClientFactory - Creating new 
> connection to /10.0.100.120:6066
> 16-07-10 13:22:58:689 DEBUG io.netty.util.ResourceLeakDetector - 
> -Dio.netty.leakDetectionLevel: simple
> 16-07-10 13:22:58:714 WARN 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint - Failed to connect 
> to master 10.0.100.120:6066
> java.io.IOException: Failed to connect to /10.0.100.120:6066
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> 
> and if I try to telnet:
> 
> $ telnet 10.0.100.120 6066
> Trying 

Re: KEYS file?

2016-07-10 Thread Shuai Lin
Not sure where you see " 0x7C6C105FFC8ED089". I think the release is signed
with the key https://people.apache.org/keys/committer/pwendell.asc .

I think this tutorial can be helpful:
http://www.apache.org/info/verification.html

On Mon, Jul 11, 2016 at 12:57 AM, Phil Steitz  wrote:

> I can't seem to find a link the the Spark KEYS file.  I am trying to
> validate the sigs on the 1.6.2 release artifacts and I need to
> import 0x7C6C105FFC8ED089.  Is there a KEYS file available for
> download somewhere?  Apologies if I am just missing an obvious link.
>
> Phil
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
Hi,

So far I have been using Spark "embedded" in my app. Now, I'd like to run it on 
a dedicated server.

I am that far:
- fresh ubuntu 16, server name is mocha / ip 10.0.100.120, installed scala 
2.10, installed Spark 1.6.2, recompiled
- Pi test works
- UI on port 8080 works

Log says:
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/opt/apache-spark-1.6.2/conf/:/opt/apache-spark-1.6.2/assembly/target/scala-2.10/spark-assembly-1.6.2-hadoop2.2.0.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-core-3.2.10.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
 -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip micha --port 7077 
--webui-port 8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/07/10 13:03:55 INFO Master: Registered signal handlers for [TERM, HUP, INT]
16/07/10 13:03:55 WARN Utils: Your hostname, micha resolves to a loopback 
address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
16/07/10 13:03:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
16/07/10 13:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/07/10 13:03:55 INFO SecurityManager: Changing view acls to: root
16/07/10 13:03:55 INFO SecurityManager: Changing modify acls to: root
16/07/10 13:03:55 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
16/07/10 13:03:56 INFO Utils: Successfully started service 'sparkMaster' on 
port 7077.
16/07/10 13:03:56 INFO Master: Starting Spark master at spark://micha:7077
16/07/10 13:03:56 INFO Master: Running Spark version 1.6.2
16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
16/07/10 13:03:56 INFO AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:8080
16/07/10 13:03:56 INFO Utils: Successfully started service 'MasterUI' on port 
8080.
16/07/10 13:03:56 INFO MasterWebUI: Started MasterWebUI at 
http://10.0.100.120:8080
16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
16/07/10 13:03:56 INFO AbstractConnector: Started 
SelectChannelConnector@micha:6066
16/07/10 13:03:56 INFO Utils: Successfully started service on port 6066.
16/07/10 13:03:56 INFO StandaloneRestServer: Started REST server for submitting 
applications on port 6066
16/07/10 13:03:56 INFO Master: I have been elected leader! New state: ALIVE


In my app, i changed the config to:
SparkConf conf = new 
SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:6066");


(also tried 7077)


On the client:
16-07-10 13:22:58:300 INFO org.spark-project.jetty.server.AbstractConnector - 
Started SelectChannelConnector@0.0.0.0:4040
16-07-10 13:22:58:300 DEBUG 
org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
SelectChannelConnector@0.0.0.0:4040
16-07-10 13:22:58:300 DEBUG 
org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
org.spark-project.jetty.server.Server@3eb292cd
16-07-10 13:22:58:301 INFO org.apache.spark.util.Utils - Successfully started 
service 'SparkUI' on port 4040.
16-07-10 13:22:58:306 INFO org.apache.spark.ui.SparkUI - Started SparkUI at 
http://10.0.100.100:4040
16-07-10 13:22:58:621 INFO 
org.apache.spark.deploy.client.AppClient$ClientEndpoint - Connecting to master 
spark://10.0.100.120:6066...
16-07-10 13:22:58:648 DEBUG 
org.apache.spark.network.client.TransportClientFactory - Creating new 
connection to /10.0.100.120:6066
16-07-10 13:22:58:689 DEBUG io.netty.util.ResourceLeakDetector - 
-Dio.netty.leakDetectionLevel: simple
16-07-10 13:22:58:714 WARN 
org.apache.spark.deploy.client.AppClient$ClientEndpoint - Failed to connect to 
master 10.0.100.120:6066
java.io.IOException: Failed to connect to /10.0.100.120:6066
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)

and if I try to telnet:

$ telnet 10.0.100.120 6066
Trying 10.0.100.120...
telnet: connect to address 10.0.100.120: Connection refused
telnet: Unable to connect to remote host

$ telnet 10.0.100.120 7077
Trying 10.0.100.120...
telnet: connect to address 10.0.100.120: Connection refused
telnet: Unable to connect to remote host

On the server, I checked with netstat:
jgp@micha:/opt/apache-spark$ netstat -a | grep 6066
tcp6   0  0 micha.nc.rr.com:6066[::]:*  LISTEN 
jgp@micha:/opt/apache-spark$ netstat -a | grep 7077
tcp6   0  0 micha.nc.rr.com:7077[::]:*  LISTEN 

If I interpret this, it looks like it listens in IP v6 and not 4...

Any clue would be very helpful. I do not think I am that far, but...

Thanks


jg






KEYS file?

2016-07-10 Thread Phil Steitz
I can't seem to find a link the the Spark KEYS file.  I am trying to
validate the sigs on the 1.6.2 release artifacts and I need to
import 0x7C6C105FFC8ED089.  Is there a KEYS file available for
download somewhere?  Apologies if I am just missing an obvious link.

Phil


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
I want to apply null comparison to a column in sqlcontext.sql, is there any
way to achieve this?
On Jul 10, 2016 8:55 PM, "Radha krishna"  wrote:

> Ok thank you, how to achieve the requirement.
>
> On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen  wrote:
>
>> It doesn't look like you have a NULL field, You have a string-value
>> field with an empty string.
>>
>> On Sun, Jul 10, 2016 at 3:19 PM, Radha krishna 
>> wrote:
>> > Hi All,IS NOT NULL is not working in programmatic sql. check below for
>> input
>> > output and code.
>> >
>> > Input
>> > 
>> > 10,IN
>> > 11,PK
>> > 12,US
>> > 13,UK
>> > 14,US
>> > 15,IN
>> > 16,
>> > 17,AS
>> > 18,AS
>> > 19,IR
>> > 20,As
>> >
>> > val cntdat = sc.textFile("/user/poc_hortonworks/radha/gsd/sample.txt");
>> > case class CNT (id:Int , code : String)
>> > val cntdf = cntdat.map((f) => { val ff=f.split(",");new
>> > CNT(ff(0).toInt,ff(1))}).toDF
>> > cntdf.registerTempTable("cntids");
>> > sqlContext.sql("SELECT MAX(id),code FROM cntids WHERE code is not null
>> GROUP
>> > BY code").show()
>> >
>> > Output
>> > =
>> > +---++
>> > |_c0|code|
>> > +---++
>> > | 18|  AS|
>> > | 16|  |
>> > | 13|  UK|
>> > | 14|  US|
>> > | 20|  As|
>> > | 15|  IN|
>> > | 19|  IR|
>> > | 11|  PK|
>> > +---++
>> >
>> > i am expecting the below one any idea, how to apply IS NOT NULL ?
>> >
>> > +---++
>> > |_c0|code|
>> > +---++
>> > | 18|  AS|
>> > | 13|  UK|
>> > | 14|  US|
>> > | 20|  As|
>> > | 15|  IN|
>> > | 19|  IR|
>> > | 11|  PK|
>> > +---++
>> >
>> >
>> >
>> > Thanks & Regards
>> >Radha krishna
>> >
>> >
>>
>
>
>
> --
>
>
>
>
>
>
>
>
> Thanks & Regards
>Radha krishna
>
>
>


How to Register Permanent User-Defined-Functions (UDFs) in SparkSQL

2016-07-10 Thread Lokesh Yadav
Hi
with sqlContext we can register a UDF like
this: sqlContext.udf.register("sample_fn", sample_fn _ )
But this UDF is limited to that particular sqlContext only. I wish to make
the registration persistent, so that I can access the same UDF in any
subsequent sqlcontext.
Or is there any other way to register UDFs in sparkSQL so that they remain
persistent?

Regards
Lokesh


Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
Ok thank you, how to achieve the requirement.

On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen  wrote:

> It doesn't look like you have a NULL field, You have a string-value
> field with an empty string.
>
> On Sun, Jul 10, 2016 at 3:19 PM, Radha krishna  wrote:
> > Hi All,IS NOT NULL is not working in programmatic sql. check below for
> input
> > output and code.
> >
> > Input
> > 
> > 10,IN
> > 11,PK
> > 12,US
> > 13,UK
> > 14,US
> > 15,IN
> > 16,
> > 17,AS
> > 18,AS
> > 19,IR
> > 20,As
> >
> > val cntdat = sc.textFile("/user/poc_hortonworks/radha/gsd/sample.txt");
> > case class CNT (id:Int , code : String)
> > val cntdf = cntdat.map((f) => { val ff=f.split(",");new
> > CNT(ff(0).toInt,ff(1))}).toDF
> > cntdf.registerTempTable("cntids");
> > sqlContext.sql("SELECT MAX(id),code FROM cntids WHERE code is not null
> GROUP
> > BY code").show()
> >
> > Output
> > =
> > +---++
> > |_c0|code|
> > +---++
> > | 18|  AS|
> > | 16|  |
> > | 13|  UK|
> > | 14|  US|
> > | 20|  As|
> > | 15|  IN|
> > | 19|  IR|
> > | 11|  PK|
> > +---++
> >
> > i am expecting the below one any idea, how to apply IS NOT NULL ?
> >
> > +---++
> > |_c0|code|
> > +---++
> > | 18|  AS|
> > | 13|  UK|
> > | 14|  US|
> > | 20|  As|
> > | 15|  IN|
> > | 19|  IR|
> > | 11|  PK|
> > +---++
> >
> >
> >
> > Thanks & Regards
> >Radha krishna
> >
> >
>



-- 








Thanks & Regards
   Radha krishna


Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Sean Owen
It doesn't look like you have a NULL field, You have a string-value
field with an empty string.

On Sun, Jul 10, 2016 at 3:19 PM, Radha krishna  wrote:
> Hi All,IS NOT NULL is not working in programmatic sql. check below for input
> output and code.
>
> Input
> 
> 10,IN
> 11,PK
> 12,US
> 13,UK
> 14,US
> 15,IN
> 16,
> 17,AS
> 18,AS
> 19,IR
> 20,As
>
> val cntdat = sc.textFile("/user/poc_hortonworks/radha/gsd/sample.txt");
> case class CNT (id:Int , code : String)
> val cntdf = cntdat.map((f) => { val ff=f.split(",");new
> CNT(ff(0).toInt,ff(1))}).toDF
> cntdf.registerTempTable("cntids");
> sqlContext.sql("SELECT MAX(id),code FROM cntids WHERE code is not null GROUP
> BY code").show()
>
> Output
> =
> +---++
> |_c0|code|
> +---++
> | 18|  AS|
> | 16|  |
> | 13|  UK|
> | 14|  US|
> | 20|  As|
> | 15|  IN|
> | 19|  IR|
> | 11|  PK|
> +---++
>
> i am expecting the below one any idea, how to apply IS NOT NULL ?
>
> +---++
> |_c0|code|
> +---++
> | 18|  AS|
> | 13|  UK|
> | 14|  US|
> | 20|  As|
> | 15|  IN|
> | 19|  IR|
> | 11|  PK|
> +---++
>
>
>
> Thanks & Regards
>Radha krishna
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread radha
Hi All,IS NOT NULL is not working in programmatic sql. check below for input
output and code.

Input 

10,IN
11,PK
12,US
13,UK
14,US
15,IN
16,
17,AS
18,AS
19,IR
20,As

val cntdat = sc.textFile("/user/poc_hortonworks/radha/gsd/sample.txt");
case class CNT (id:Int , code : String)
val cntdf = cntdat.map((f) => { val ff=f.split(",");new
CNT(ff(0).toInt,ff(1))}).toDF
cntdf.registerTempTable("cntids");
sqlContext.sql("SELECT MAX(id),code FROM cntids WHERE code is not null GROUP
BY code").show()

Output
=
+---++
|_c0|code|
+---++
| 18|  AS|
| 16|  |
| 13|  UK|
| 14|  US|
| 20|  As|
| 15|  IN|
| 19|  IR|
| 11|  PK|
+---++

i am expecting the below one any idea, how to apply IS NOT NULL ?

+---++
|_c0|code|
+---++
| 18|  AS|
| 13|  UK|
| 14|  US|
| 20|  As|
| 15|  IN|
| 19|  IR|
| 11|  PK|
+---++



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/IS-NOT-NULL-is-not-working-in-programmatic-SQL-in-spark-tp27317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
Hi All,IS NOT NULL is not working in programmatic sql. check below for
input output and code.

Input

10,IN
11,PK
12,US
13,UK
14,US
15,IN
16,
17,AS
18,AS
19,IR
20,As

val cntdat = sc.textFile("/user/poc_hortonworks/radha/gsd/sample.txt");
case class CNT (id:Int , code : String)
val cntdf = cntdat.map((f) => { val ff=f.split(",");new
CNT(ff(0).toInt,ff(1))}).toDF
cntdf.registerTempTable("cntids");
sqlContext.sql("SELECT MAX(id),code FROM cntids WHERE code is not null
GROUP BY code").show()

Output
=
+---++
|_c0|code|
+---++
| 18|  AS|
| 16|  |
| 13|  UK|
| 14|  US|
| 20|  As|
| 15|  IN|
| 19|  IR|
| 11|  PK|
+---++

i am expecting the below one any idea, how to apply IS NOT NULL ?

+---++
|_c0|code|
+---++
| 18|  AS|
| 13|  UK|
| 14|  US|
| 20|  As|
| 15|  IN|
| 19|  IR|
| 11|  PK|
+---++



Thanks & Regards
   Radha krishna


location of a partition in the cluster/ how parallelize method distribute the RDD partitions over the cluster.

2016-07-10 Thread Mazen
Hi, 

Any hint about getting the location of a particular RDD partition on the
cluster? a workaround?


Parallelize method on RDDs partitions the RDD into splits  as specified or 
per as per the  default parallelism configuration. Does parallelize actually
distribute the partitions into the cluster or the partitions are kept on the
driver node. In the first case is there a protocol for assigning/mapping
partitions (parallelocollectionpartition) to workers or it is just random.
Otherwise, when partitions are distributed on the cluster? Is that when 
tasks are launched on partitions?

thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/location-of-a-partition-in-the-cluster-how-parallelize-method-distribute-the-RDD-partitions-over-the-tp27316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-10 Thread Lars Albertsson
Let us assume that you want to build an integration test setup where
you run all participating components in Docker.

You create a docker-compose.yml with four Docker images, something like this:

# Start docker-compose.yml
version: '2'

services:
  myapp:
build: myapp_dir
links:
  - kafka
  - cassandra

  kafka:
image: spotify/kafka
environment:
  - ADVERTISED_HOST
ports:
  - "2181:2181"
  - "9092:9092"

  cassandra:
image: spotify/cassandra
environment:
  - 
ports:
  - "9042:9042"

  test_harness:
build: test_harness_dir
links:
  - kafka
  - cassandra
# End docker-compose.yml

I haven't used the spotify/cassandra image, so you might need to do
some environment variable plumbing to get it working.

Your test harness would then push messages to Kafka, and poll
Cassandra for the expected output. Your Spark Streaming application
has Spark installed on the
Docker image, and runs Spark with local master.


You need to run this on a machine that has Docker and Docker Compose
installed, typically a Ubuntu host. This machine can either be bare
metal or a full VM (Virtualbox, VMware, Xen), which is what you get if
you run in an IaaS cloud like GCE or EC2. Hence, your CI/CD Jenkins
machine should be a dedicated instance.

Developers with Macs would run docker-machine, which uses Virtualbox
IIRC. Developers with Linux machines can run Docker and Docker Compose
natively.

You can in theory run Jenkins in Docker and spin up new Docker
containers from inside Docker using some docker-inside-docker setup.
It will add complexity, however, and I suspect it will be brittle, so
I don't recommend it.

You could also in theory use some cloud container service that runs
your images during tests. They have different ways of welding Docker
images than Docker Compose, however, so it also increases complexity
and makes the CI/CD setup different than the setup on local developer
machines. I went down this path once, but I cannot recommend it.


If you instead want a setup where the test harness and your Spark
Streaming application runs outside Docker, you omit them from
docker-compose.yml, and have the test harness run docker-compose, and
figure out the ports and addresses to connect to. As mentioned
earlier, this requires more plumbing, but results in an integration
test setup that runs smoothly from Gradle/Maven/SBT and also from
IntelliJ.

I hope things are clearer. Let me know if you have further questions.

Regards,



Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109
Calendar: https://goo.gl/6FBtlS



On Thu, Jul 7, 2016 at 3:14 AM, swetha kasireddy
 wrote:
> Can this docker image be used to spin up kafka cluster in a CI/CD pipeline
> like Jenkins to run the integration tests? Or it can be done only in the
> local machine that has docker installed? I assume that the box where the
> CI/CD pipeline runs should have docker installed correct?
>
> On Mon, Jul 4, 2016 at 5:20 AM, Lars Albertsson  wrote:
>>
>> I created such a setup for a client a few months ago. It is pretty
>> straightforward, but it can take some work to get all the wires
>> connected.
>>
>> I suggest that you start with the spotify/kafka
>> (https://github.com/spotify/docker-kafka) Docker image, since it
>> includes a bundled zookeeper. The alternative would be to spin up a
>> separate Zookeeper Docker container and connect them, but for testing
>> purposes, it would make the setup more complex.
>>
>> You'll need to inform Kafka about the external address it exposes by
>> setting ADVERTISED_HOST to the output of "docker-machine ip" (on Mac)
>> or the address printed by "ip addr show docker0" (Linux). I also
>> suggest setting
>> AUTO_CREATE_TOPICS to true.
>>
>> You can choose to run your Spark Streaming application under test
>> (SUT) and your test harness also in Docker containers, or directly on
>> your host.
>>
>> In the former case, it is easiest to set up a Docker Compose file
>> linking the harness and SUT to Kafka. This variant provides better
>> isolation, and might integrate better if you have existing similar
>> test frameworks.
>>
>> If you want to run the harness and SUT outside Docker, I suggest that
>> you build your harness with a standard test framework, e.g. scalatest
>> or JUnit, and run both harness and SUT in the same JVM. In this case,
>> you put code to bring up the Kafka Docker container in test framework
>> setup methods. This test strategy integrates better with IDEs and
>> build tools (mvn/sbt/gradle), since they will run (and debug) your
>> tests without any special integration. I therefore prefer this
>> strategy.
>>
>>
>> What is the output of your application? If it is messages on a
>> different Kafka topic, the test harness can merely subscribe and
>> verify output. If you emit output to a database, you'll need another
>> Docker container, integrated with Docker Compose. If you are