Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Thanks Krishna for your response.
Features in the training set has more categories than test set so when
vectorAssembler is used these numbers are usually different and I believe
it is as expected right ?

Test dataset usually will not have so many categories in their features as
Train is the belief here.

On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar  wrote:

> Hi,
>Just after I sent the mail, I realized that the error might be with the
> training-dataset not the test-dataset.
>
>1. it might be that you are feeding the full Y vector for training.
>2. Which could mean, you are using ~50-50 training-test split.
>3. Take a good look at the code that does the data split and the
>datasets where they are allocated to.
>
> Cheers
> 
>
> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar 
> wrote:
>
>> Hi,
>>   Looks like the test-dataset has different sizes for X & Y. Possible
>> steps:
>>
>>1. What is the test-data-size ?
>>   - If it is 15,909, check the prediction variable vector - it is
>>   now 29,471, should be 15,909
>>   - If you expect it to be 29,471, then the X Matrix is not right.
>>   2. It is also probable that the size of the test-data is something
>>else. If so, check the data pipeline.
>>3. If you print the count() of the various vectors, I think you can
>>find the error.
>>
>> Cheers & Good Luck
>> 
>>
>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty > > wrote:
>>
>>> Hi,
>>>
>>> I have built the logistic regression model using training-dataset.
>>> When I am predicting on a test-dataset, it is throwing the below error
>>> of size mismatch.
>>>
>>> Steps done:
>>> 1. String indexers on categorical features.
>>> 2. One hot encoding on these indexed features.
>>>
>>> Any help is appreciated to resolve this issue or is it a bug ?
>>>
>>> SparkException: *Job aborted due to stage failure: Task 0 in stage
>>> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
>>> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
>>> failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching
>>> sizes: x.size = 15909, y.size = 29471* at 
>>> scala.Predef$.require(Predef.scala:224)
>>> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
>>> org.apache.spark.ml.classification.LogisticRegressionModel$$
>>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml
>>> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>>> at org.apache.spark.ml.classification.LogisticRegressionModel.p
>>> redictRaw(LogisticRegression.scala:594) at org.apache.spark.ml
>>> .classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
>>> at org.apache.spark.ml.classification.ProbabilisticClassificati
>>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>>> org.apache.spark.ml.classification.ProbabilisticClassificati
>>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.evalExpr137$(Unknown Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.apply(Unknown Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.apply(Unknown Source) at
>>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>>
>>
>>
>


Re: Entire XML data as one of the column in DataFrame

2016-08-21 Thread Hyukjin Kwon
I can't say this is the best way to do so but my instant thought is as
below:


Create two df

sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"")
sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"")
sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8")
val strXmlDf = sc.newAPIHadoopFile(carsFile,
  classOf[XmlInputFormat],
  classOf[LongWritable],
  classOf[Text]).map { pair =>
new String(pair._2.getBytes, 0, pair._2.getLength)
  }.toDF("XML")

val xmlDf = sqlContext.read.format("xml")
  .option("rowTag", "emplist")
  .load(path)

​

zip those two maybe like this https://github.com/apache/spark/pull/7474


and then starts to filter with emp.id or emp.name.



2016-08-22 5:31 GMT+09:00 :

> Hello Experts,
>
>
>
> I’m using spark-xml package which is automatically inferring my schema and
> creating a DataFrame.
>
>
>
> I’m extracting few fields like id, name (which are unique) from below xml,
> but my requirement is to store entire XML in one of the column as well. I’m
> writing this data to AVRO hive table. Can anyone tell me how to achieve
> this?
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> 
>
> 
>
>
>
>1
>
>foo
>
> 
>
>   
>
> 1
>
> foo
>
>   
>
>   
>
> 1
>
> foo
>
>   
>
> 
>
>
>
> 
>
> 
>
>
>
> Expected output:
>
> id, name, XML
>
> 1, foo,  ….
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>


Hi,

2016-08-21 Thread Xi Shen
I found there are several .conf files in the conf directory, which one is
used as the default one when I click the "new" button on the notebook
homepage? I want to edit the default profile configuration so all my
notebooks are created with custom settings.

-- 


Thanks,
David S.


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi,
   Just after I sent the mail, I realized that the error might be with the
training-dataset not the test-dataset.

   1. it might be that you are feeding the full Y vector for training.
   2. Which could mean, you are using ~50-50 training-test split.
   3. Take a good look at the code that does the data split and the
   datasets where they are allocated to.

Cheers


On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar  wrote:

> Hi,
>   Looks like the test-dataset has different sizes for X & Y. Possible
> steps:
>
>1. What is the test-data-size ?
>   - If it is 15,909, check the prediction variable vector - it is now
>   29,471, should be 15,909
>   - If you expect it to be 29,471, then the X Matrix is not right.
>   2. It is also probable that the size of the test-data is something
>else. If so, check the data pipeline.
>3. If you print the count() of the various vectors, I think you can
>find the error.
>
> Cheers & Good Luck
> 
>
> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty 
> wrote:
>
>> Hi,
>>
>> I have built the logistic regression model using training-dataset.
>> When I am predicting on a test-dataset, it is throwing the below error of
>> size mismatch.
>>
>> Steps done:
>> 1. String indexers on categorical features.
>> 2. One hot encoding on these indexed features.
>>
>> Any help is appreciated to resolve this issue or is it a bug ?
>>
>> SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0
>> failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID
>> 19421, localhost): java.lang.IllegalArgumentException: requirement failed:
>> BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:
>> x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224)
>> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
>> org.apache.spark.ml.classification.LogisticRegressionModel$$
>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml
>> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>> at org.apache.spark.ml.classification.LogisticRegressionModel.p
>> redictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.classifica
>> tion.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484) at
>> org.apache.spark.ml.classification.ProbabilisticClassificati
>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>> org.apache.spark.ml.classification.ProbabilisticClassificati
>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.evalExpr137$(Unknown Source) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.apply(Unknown Source) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.apply(Unknown Source) at
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>
>
>


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi,
  Looks like the test-dataset has different sizes for X & Y. Possible steps:

   1. What is the test-data-size ?
  - If it is 15,909, check the prediction variable vector - it is now
  29,471, should be 15,909
  - If you expect it to be 29,471, then the X Matrix is not right.
  2. It is also probable that the size of the test-data is something
   else. If so, check the data pipeline.
   3. If you print the count() of the various vectors, I think you can find
   the error.

Cheers & Good Luck


On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty 
wrote:

> Hi,
>
> I have built the logistic regression model using training-dataset.
> When I am predicting on a test-dataset, it is throwing the below error of
> size mismatch.
>
> Steps done:
> 1. String indexers on categorical features.
> 2. One hot encoding on these indexed features.
>
> Any help is appreciated to resolve this issue or is it a bug ?
>
> SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0
> failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID
> 19421, localhost): java.lang.IllegalArgumentException: requirement failed:
> BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:
> x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
> org.apache.spark.ml.classification.LogisticRegressionModel$$
> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml.
> classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
> at org.apache.spark.ml.classification.LogisticRegressionModel.
> predictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.
> classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
> at org.apache.spark.ml.classification.ProbabilisticClassificationMod
> el$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
> org.apache.spark.ml.classification.ProbabilisticClassificationMod
> el$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.evalExpr137$(Unknown Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.apply(Unknown Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.apply(Unknown Source) at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>


Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Hi,

I have built the logistic regression model using training-dataset.
When I am predicting on a test-dataset, it is throwing the below error of
size mismatch.

Steps done:
1. String indexers on categorical features.
2. One hot encoding on these indexed features.

Any help is appreciated to resolve this issue or is it a bug ?

SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0
failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID
19421, localhost): java.lang.IllegalArgumentException: requirement failed:
BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:
x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:505)
at
org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
at
org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:594)
at
org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
at
org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112)
at
org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr137$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)


Re: submitting spark job with kerberized Hadoop issue

2016-08-21 Thread Aneela Saleem
Any update on this?

On Tuesday, 16 August 2016, Aneela Saleem  wrote:

> Thanks Steve,
>
> I went through this but still not able to fix the issue
>
> On Mon, Aug 15, 2016 at 2:01 AM, Steve Loughran  > wrote:
>
>> Hi,
>>
>> Just came across this while going through all emails I'd left unread over
>> my vacation.
>>
>> did you manage to fix this?
>>
>> 1. There's some notes I've taken on this topic:
>> https://www.gitbook.com/book/steveloughran/kerberos_and_hadoop/details
>>
>>  -look at "Error messages to fear" to see if this one has surfaced;
>> otherwise look at "low level secrets" to see how to start debugging things
>>
>>
>> On 5 Aug 2016, at 14:54, Aneela Saleem > > wrote:
>>
>> Hi all,
>>
>> I'm trying to connect to Kerberized Hadoop cluster using spark job. I
>> have kinit'd from command line. When i run the following job i.e.,
>>
>> *./bin/spark-submit --keytab /etc/hadoop/conf/spark.keytab --principal
>> spark/hadoop-master@platalyticsrealm --class
>> com.platalytics.example.spark.App --master spark://hadoop-master:7077
>> /home/vm6/project-1-jar-with-dependencies.jar
>> hdfs://hadoop-master:8020/text*
>>
>> I get the error:
>>
>> Caused by: java.io.IOException: 
>> org.apache.hadoop.security.AccessControlException:
>> Client cannot authenticate via:[TOKEN, KERBEROS]
>> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>> upInformation.java:1628)
>>
>> Following are the contents of *spark-defaults.conf* file:
>>
>> spark.master spark://hadoop-master:7077
>> spark.eventLog.enabled   true
>> spark.eventLog.dir   hdfs://hadoop-master:8020/spark/logs
>> spark.serializer org.apache.spark.serializer.Kr
>> yoSerializer
>> spark.yarn.access.namenodes hdfs://hadoop-master:8020/
>> spark.yarn.security.tokens.hbase.enabled true
>> spark.yarn.security.tokens.hive.enabled true
>> spark.yarn.principal yarn/hadoop-master@platalyticsrealm
>> spark.yarn.keytab /etc/hadoop/conf/yarn.keytab
>>
>>
>> Also i have added following in *spark-env.sh* file:
>>
>> HOSTNAME=`hostname -f`
>> export SPARK_HISTORY_OPTS="-Dspark.history.kerberos.enabled=true
>> -Dspark.history.kerberos.principal=spark/${HOSTNAME}@platalyticsrealm
>> -Dspark.history.kerberos.keytab=/etc/hadoop/conf/spark.keytab"
>>
>>
>> Please guide me, how to trace the issue?
>>
>> Thanks
>>
>>
>>
>


Re: Accessing HBase through Spark with Security enabled

2016-08-21 Thread Aneela Saleem
Any update on this?

On Tuesday, 16 August 2016, Aneela Saleem  wrote:

> Thanks Steve,
>
> I have gone through it's documentation, i did not get any idea how to
> install it. Can you help me?
>
> On Mon, Aug 15, 2016 at 4:23 PM, Steve Loughran  > wrote:
>
>>
>> On 15 Aug 2016, at 08:29, Aneela Saleem > > wrote:
>>
>> Thanks Jacek!
>>
>> I have already set hbase.security.authentication property set to
>> kerberos, since Hbase with kerberos is working fine.
>>
>> I tested again after correcting the typo but got same error. Following is
>> the code, Please have a look:
>>
>> System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
>> System.setProperty("java.security.auth.login.config",
>> "/etc/hbase/conf/zk-jaas.conf");
>> val hconf = HBaseConfiguration.create()
>> val tableName = "emp"
>> hconf.set("hbase.zookeeper.quorum", "hadoop-master")
>> hconf.set(TableInputFormat.INPUT_TABLE, tableName)
>> hconf.set("hbase.zookeeper.property.clientPort", "2181")
>> hconf.set("hbase.master", "hadoop-master:6")
>> hconf.set("hadoop.security.authentication", "kerberos")
>> hconf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
>> hconf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
>>
>>
>> spark should be automatically picking those up from the classpath; adding
>> them to your  own hconf isn't going to have any effect on the hbase config
>> used to extract the hbase token on Yarn app launch. That all needs to be
>> set up at the time the Spark cluster/app is launched. If you are running
>>
>> There's a little diagnostics tool, kdiag, which will be in future Hadoop
>> versions —It's available as a standalone JAR for others to use
>>
>> https://github.com/steveloughran/kdiag
>>
>> This may help verify things like your keytab/login details
>>
>>
>> val conf = new SparkConf()
>> conf.set("spark.yarn.security.tokens.hbase.enabled", "true")
>> conf.set("spark.authenticate", "true")
>> conf.set("spark.authenticate.secret","None")
>> val sc = new SparkContext(conf)
>> val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat],
>> classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
>> classOf[org.apache.hadoop.hbase.client.Result])
>>
>> val count = hBaseRDD.count()
>> print("HBase RDD count:" + count)
>>
>>
>>
>>
>> On Sat, Aug 13, 2016 at 8:36 PM, Jacek Laskowski > > wrote:
>>
>>> Hi Aneela,
>>>
>>> My (little to no) understanding of how to make it work is to use
>>> hbase.security.authentication property set to kerberos (see [1]).
>>>
>>>
>> Nobody understands kerberos; you are not alone. And the more you
>> understand of Kerberos, the less you want to.
>>
>> Spark on YARN uses it to get the tokens for Hive, HBase et al (see
>>> [2]). It happens when Client starts conversation to YARN RM (see [3]).
>>>
>>> You should not do that yourself (and BTW you've got a typo in
>>> spark.yarn.security.tokens.habse.enabled setting). I think that the
>>> entire code you pasted matches the code Spark's doing itself before
>>> requesting resources from YARN.
>>>
>>> Give it a shot and report back since I've never worked in such a
>>> configuration and would love improving in this (security) area.
>>> Thanks!
>>>
>>> [1] http://www.cloudera.com/documentation/enterprise/5-5-x/topic
>>> s/cdh_sg_hbase_authentication.html#concept_zyz_vg5_nt__secti
>>> on_s1l_nwv_ls
>>> [2] https://github.com/apache/spark/blob/master/yarn/src/main/sc
>>> ala/org/apache/spark/deploy/yarn/security/HBaseCredentialPro
>>> vider.scala#L58
>>> [3] https://github.com/apache/spark/blob/master/yarn/src/main/sc
>>> ala/org/apache/spark/deploy/yarn/Client.scala#L396
>>>
>>>
>>
>> [2] is the code from last week; SPARK-14743. The predecessor code was
>> pretty similar though: make an RPC call to HBase to ask for an HBase
>> delegation token to be handed off to the YARN app; it requires the use to
>> be Kerberos authenticated first.
>>
>>
>> Pozdrawiam,
>>> Jacek Laskowski
>>>
>>> >> > 2016-08-07 20:43:57,617 WARN
>>> >> > [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1]
>>> ipc.RpcClientImpl:
>>> >> > Exception encountered while connecting to the server :
>>> >> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
>>> >> > GSSException: No valid credentials provided (Mechanism level:
>>> Failed to
>>> >> > find
>>> >> > any Kerberos tgt)]
>>> >> > 2016-08-07 20:43:57,619 ERROR
>>> >> > [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1]
>>> ipc.RpcClientImpl:
>>> >> > SASL
>>> >> > authentication failed. The most likely cause is missing or invalid
>>> >> > credentials. Consider 'kinit'.
>>> >> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
>>> >> > GSSException: No valid credentials provided (Mechanism level:
>>> Failed to
>>> >> > find

RE: Flattening XML in a DataFrame

2016-08-21 Thread srikanth.jella
Hi Hyukjin,

I have created the below issue.
https://github.com/databricks/spark-xml/issues/155


Sent from Mail for Windows 10

From: Hyukjin Kwon

Entire XML data as one of the column in DataFrame

2016-08-21 Thread srikanth.jella
Hello Experts,

I’m using spark-xml package which is automatically inferring my schema and 
creating a DataFrame. 

I’m extracting few fields like id, name (which are unique) from below xml, but 
my requirement is to store entire XML in one of the column as well. I’m writing 
this data to AVRO hive table. Can anyone tell me how to achieve this? 

Example XML and expected output is given below.

Sample XML:


   
   1
   foo
    
  
    1
    foo
  
  
    1
    foo
  
    
   


 
Expected output:
id, name, XML
1, foo,  ….
 
Thanks,
Sreekanth Jella
 



Re: Spark Streaming application failing with Token issue

2016-08-21 Thread Mich Talebzadeh
Hi Kamesh,

The message you are getting after 7 days:

PriviledgedActionException as:sys_bio_replicator (auth:KERBEROS)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.
hadoop.security.token.SecretManager$InvalidToken): Token has expired

Sounds like an IPC issue with Kerberos Authentication time out. Suggest you
talk to your Linux admin and also check this link:

https://support.f5.com/kb/en-us/products/big-ip_apm/manuals/product/apm-authentication-single-sign-on-11-5-0/9.print.html


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 August 2016 at 16:43, Jacek Laskowski  wrote:

> Hi Kamesh,
>
> I believe your only option is to re-start your application every 7
> days (perhaps you need to enable checkpointing). See
> https://github.com/apache/spark/commit/ab648c0004cfb20d53554ab333dd2d
> 198cb94ffa
> for a change with automatic security token renewal.
>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Aug 18, 2016 at 11:51 AM, Kamesh  wrote:
> > Hi all,
> >
> >  I am running a spark streaming application that store events into
> > Secure(Kerborized) HBase cluster. I launched this spark streaming
> > application by passing --principal and --keytab. Despite this, spark
> > streaming application is failing after 7days with Token issue. Can
> someone
> > please suggest how to fix this.
> >
> > Error Message
> >
> > 16/08/18 02:39:45 WARN ipc.AbstractRpcClient: Exception encountered while
> > connecting to the server :
> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.
> SecretManager$InvalidToken):
> > Token has expired
> >
> > 16/08/18 02:39:45 WARN security.UserGroupInformation:
> > PriviledgedActionException as:sys_bio_replicator (auth:KERBEROS)
> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.
> hadoop.security.token.SecretManager$InvalidToken):
> > Token has expired
> >
> > Environment
> >
> > Spark Version : 1.6.1
> >
> > HBase version : 1.0.0
> >
> > Hadoop Version : 2.6.0
> >
> >
> > Thanks & Regards
> > Kamesh.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to continuous update or refresh RandomForestClassificationModel

2016-08-21 Thread Jacek Laskowski
Hi,

That's my understanding -- you need to fit another model given the
training data.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Aug 19, 2016 at 10:21 AM, 陈哲  wrote:
> Hi All
>I'm using my training data generate the
> RandomForestClassificationModel , and I can use this to predict the upcoming
> data.
>But if predict failed I'll put the failed features into the training
> data, here is my question , how can I update or refresh the model ? Which
> api should I use ? Do I need to generate this model from start again?
>
> Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Everett Anderson
On Sun, Aug 21, 2016 at 3:08 AM, Bedrytski Aliaksandr 
wrote:

> Hi,
>
> we share the same spark/hive context between tests (executed in
> parallel), so the main problem is that the temporary tables are
> overwritten each time they are created, this may create race conditions
> as these tempTables may be seen as global mutable shared state.
>
> So each time we create a temporary table, we add an unique, incremented,
> thread safe id (AtomicInteger) to its name so that there are only
> specific, non-shared temporary tables used for a test.
>

Makes sense.

But when you say you're sharing the same spark/hive context between tests,
I'm assuming that's between the same tests within one test class, but
you're not sharing across test classes (which a build tool like Maven or
Gradle might have executed in separate JVMs).

Is that right?




>
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
> Hi!
>
> Just following up on this --
>
> When people talk about a shared session/context for testing like this,
> I assume it's still within one test class. So it's still the case that
> if you have a lot of test classes that test Spark-related things, you
> must configure your build system to not run in them in parallel.
> You'll get the benefit of not creating and tearing down a Spark
> session/context between test cases with a test class, though.
>
> Is that right?
>
> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
> share Spark sessions/contexts across integration tests in a safe way?
>
>
> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
>  wrote:
> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the
> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
> I'll try and include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
>  wrote:
> we share a single single sparksession across tests, and they can run
> in parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
>  wrote:
> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup
> that brings up a local master as in this article[1].
>
> That's a lot of overhead for unit testing and the tests can't run
> in parallel, so testing is slow -- this is more like what I'd call
> an integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks
> on fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit
> testing possible with Spark SQL, perhaps with a stripped down in-
> memory implementation? (I admit this does seem quite hard since
> there's so much functionality in these classes!)
>
> Thanks!
>
>
> - Everett
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


Re: Reporting errors from spark sql

2016-08-21 Thread Jacek Laskowski
Hi,

See 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala#L65
to learn how Spark SQL parses SQL texts. It could give you a way out.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Aug 18, 2016 at 3:14 PM, yael aharon  wrote:
> Hello,
> I am working on an SQL editor which is powered by spark SQL. When the SQL is
> not valid, I would like to provide the user with a line number and column
> number where the first error occurred. I am having a hard time finding a
> mechanism that will give me that information programmatically.
>
> Most of the time, if an erroneous SQL statement is used, I am getting a
> RuntimeException, where line number and column number are implicitly
> embedded within the text of the message, but it is really error prone to
> parse the message text and count the number of spaces prior to the '^'
> symbol...
>
> Sometimes, AnalysisException is used, but when I try to extract the line and
> startPosition from it, they are always empty.
>
> Any help would be greatly appreciated.
> thanks!

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unsubscribe

2016-08-21 Thread Rahul Palamuttam
Hi sudhanshu,

Try user-unsubscribe.spark.apache.org

- Rahul P



Sent from my iPhone
> On Aug 21, 2016, at 9:19 AM, Sudhanshu Janghel 
>  wrote:
> 
> Hello,
> 
> I wish to unsubscribe from the channel.
> 
> KIND REGARDS, 
> SUDHANSHU


Unsubscribe

2016-08-21 Thread Sudhanshu Janghel
Hello,

I wish to unsubscribe from the channel.

KIND REGARDS,
SUDHANSHU


Re: Spark Streaming application failing with Token issue

2016-08-21 Thread Jacek Laskowski
Hi Kamesh,

I believe your only option is to re-start your application every 7
days (perhaps you need to enable checkpointing). See
https://github.com/apache/spark/commit/ab648c0004cfb20d53554ab333dd2d198cb94ffa
for a change with automatic security token renewal.


Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Aug 18, 2016 at 11:51 AM, Kamesh  wrote:
> Hi all,
>
>  I am running a spark streaming application that store events into
> Secure(Kerborized) HBase cluster. I launched this spark streaming
> application by passing --principal and --keytab. Despite this, spark
> streaming application is failing after 7days with Token issue. Can someone
> please suggest how to fix this.
>
> Error Message
>
> 16/08/18 02:39:45 WARN ipc.AbstractRpcClient: Exception encountered while
> connecting to the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Token has expired
>
> 16/08/18 02:39:45 WARN security.UserGroupInformation:
> PriviledgedActionException as:sys_bio_replicator (auth:KERBEROS)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Token has expired
>
> Environment
>
> Spark Version : 1.6.1
>
> HBase version : 1.0.0
>
> Hadoop Version : 2.6.0
>
>
> Thanks & Regards
> Kamesh.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best way to read XML data from RDD

2016-08-21 Thread Darin McBeath
Another option would be to look at spark-xml-utils.  We use this extensively in 
the manipulation of our XML content.

https://github.com/elsevierlabs-os/spark-xml-utils



There are quite a few examples.  Depending on your preference (and what you 
want to do), you could use xpath, xquery, or xslt to transform, extract, or 
filter.

Like mentioned below, you want to initialize the parser in a mapPartitions call 
(one of the examples shows this).

Hope this is helpful.

Darin.






From: Hyukjin Kwon 
To: Jörn Franke  
Cc: Diwakar Dhanuskodi ; Felix Cheung 
; user 
Sent: Sunday, August 21, 2016 6:10 AM
Subject: Re: Best way to read XML data from RDD



Hi Diwakar,

Spark XML library can take RDD as source.

```
val df = new XmlReader()
  .withRowTag("book")
  .xmlRdd(sqlContext, rdd)
```

If performance is critical, I would also recommend to take care of creation and 
destruction of the parser.

If the parser is not serializble, then you can do the creation for each 
partition within mapPartition just like

https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325


I hope this is helpful.




2016-08-20 15:10 GMT+09:00 Jörn Franke :

I fear the issue is that this will create and destroy a XML parser object 2 mio 
times, which is very inefficient - it does not really look like a parser 
performance issue. Can't you do something about the format choice? Ask your 
supplier to deliver another format (ideally avro or sth like this?)?
>Otherwise you could just create one XML Parser object / node, but sharing this 
>among the parallel tasks on the same node is tricky.
>The other possibility could be simply more hardware ...
>
>On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi  
>wrote:
>
>
>Yes . It accepts a xml file as source but not RDD. The XML data embedded  
>inside json is streamed from kafka cluster.  So I could get it as RDD. 
>>Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map 
>>function  but  performance  wise I am not happy as it takes 4 minutes to 
>>parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. 
>>
>>
>>
>>
>>Sent from Samsung Mobile.
>>
>>
>> Original message 
>>From: Felix Cheung  
>>Date:20/08/2016  09:49  (GMT+05:30) 
>>To: Diwakar Dhanuskodi  , user 
>> 
>>Cc: 
>>Subject: Re: Best way to read XML data from RDD 
>>
>>
>>Have you tried
>>
>>https://github.com/databricks/ spark-xml
>>?
>>
>>
>>
>>
>>
>>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" 
>> wrote:
>>
>>
>>Hi,  
>>
>>
>>There is a RDD with json data. I could read json data using rdd.read.json . 
>>The json data has XML data in couple of key-value paris. 
>>
>>
>>Which is the best method to read and parse XML from rdd. Is there any 
>>specific xml libraries for spark. Could anyone help on this.
>>
>>
>>Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2016-08-21 Thread Sean Owen
You are attempting to read a tar file. That won't work. A compressed JSON
file would.

On Sun, Aug 21, 2016, 12:52 Chua Jie Sheng  wrote:

> Hi Spark user list!
>
> I have been encountering corrupted records when reading Gzipped files that
> contains more than one file.
>
> Example:
> I have two .json file, [a.json, b.json]
> Each have multiple records (one line, one record).
>
> I tar both of them together on
>
> Mac OS X, 10.11.6
> bsdtar 2.8.3 - libarchive 2.8.3
>
> i.e. tar -czf a.tgz *.json
>
>
> When I attempt to read them (via Python):
>
> filename = "a.tgz"
> sqlContext = SQLContext(sc)
> datasets = sqlContext.read.json(filename)
>
> datasets.show(1, truncate=False)
>
>
> My first record will always be corrupted, showing up in _corrupt_record.
>
> Does anyone have any idea if it is feature or a defect?
>
> Best Regards
> Jie Sheng
>
> Important: This email is confidential and may be privileged. If you are
> not the intended recipient, please delete it and notify us immediately; you
> should not copy or use it for any purpose, nor disclose its contents to any
> other person. Thank you.
>


Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2016-08-21 Thread Chua Jie Sheng
Hi Spark user list!

I have been encountering corrupted records when reading Gzipped files that
contains more than one file.

Example:
I have two .json file, [a.json, b.json]
Each have multiple records (one line, one record).

I tar both of them together on

Mac OS X, 10.11.6
bsdtar 2.8.3 - libarchive 2.8.3

i.e. tar -czf a.tgz *.json


When I attempt to read them (via Python):

filename = "a.tgz"
sqlContext = SQLContext(sc)
datasets = sqlContext.read.json(filename)

datasets.show(1, truncate=False)


My first record will always be corrupted, showing up in _corrupt_record.

Does anyone have any idea if it is feature or a defect?

Best Regards
Jie Sheng

Important: This email is confidential and may be privileged. If you are not
the intended recipient, please delete it and notify us immediately; you
should not copy or use it for any purpose, nor disclose its contents to any
other person. Thank you.


Re: Best way to read XML data from RDD

2016-08-21 Thread Hyukjin Kwon
Hi Diwakar,

Spark XML library can take RDD as source.

```
val df = new XmlReader()
  .withRowTag("book")
  .xmlRdd(sqlContext, rdd)
```

If performance is critical, I would also recommend to take care of creation
and destruction of the parser.

If the parser is not serializble, then you can do the creation for each
partition within mapPartition just like

https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325

I hope this is helpful.



2016-08-20 15:10 GMT+09:00 Jörn Franke :

> I fear the issue is that this will create and destroy a XML parser object
> 2 mio times, which is very inefficient - it does not really look like a
> parser performance issue. Can't you do something about the format choice?
> Ask your supplier to deliver another format (ideally avro or sth like
> this?)?
> Otherwise you could just create one XML Parser object / node, but sharing
> this among the parallel tasks on the same node is tricky.
> The other possibility could be simply more hardware ...
>
> On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi 
> wrote:
>
> Yes . It accepts a xml file as source but not RDD. The XML data embedded
>  inside json is streamed from kafka cluster.  So I could get it as RDD.
> Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map
> function  but  performance  wise I am not happy as it takes 4 minutes to
> parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.
>
>
> Sent from Samsung Mobile.
>
>
>  Original message 
> From: Felix Cheung 
> Date:20/08/2016 09:49 (GMT+05:30)
> To: Diwakar Dhanuskodi , user <
> user@spark.apache.org>
> Cc:
> Subject: Re: Best way to read XML data from RDD
>
> Have you tried
>
> https://github.com/databricks/spark-xml
> ?
>
>
>
>
> On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <
> diwakar.dhanusk...@gmail.com> wrote:
>
> Hi,
>
> There is a RDD with json data. I could read json data using rdd.read.json
> . The json data has XML data in couple of key-value paris.
>
> Which is the best method to read and parse XML from rdd. Is there any
> specific xml libraries for spark. Could anyone help on this.
>
> Thanks.
>
>


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Bedrytski Aliaksandr
Hi,

we share the same spark/hive context between tests (executed in
parallel), so the main problem is that the temporary tables are
overwritten each time they are created, this may create race conditions
as these tempTables may be seen as global mutable shared state.

So each time we create a temporary table, we add an unique, incremented,
thread safe id (AtomicInteger) to its name so that there are only
specific, non-shared temporary tables used for a test.

--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
> Hi!
>
> Just following up on this --
>
> When people talk about a shared session/context for testing like this,
> I assume it's still within one test class. So it's still the case that
> if you have a lot of test classes that test Spark-related things, you
> must configure your build system to not run in them in parallel.
> You'll get the benefit of not creating and tearing down a Spark
> session/context between test cases with a test class, though.
>
> Is that right?
>
> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
> share Spark sessions/contexts across integration tests in a safe way?
>
>
> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
>  wrote:
> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the
> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
> I'll try and include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
>  wrote:
> we share a single single sparksession across tests, and they can run
> in parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
>  wrote:
> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup
> that brings up a local master as in this article[1].
>
> That's a lot of overhead for unit testing and the tests can't run
> in parallel, so testing is slow -- this is more like what I'd call
> an integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks
> on fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit
> testing possible with Spark SQL, perhaps with a stripped down in-
> memory implementation? (I admit this does seem quite hard since
> there's so much functionality in these classes!)
>
> Thanks!
>
>
> - Everett
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau


DCOS - s3

2016-08-21 Thread Martin Somers
I having trouble loading data from an s3 repo
Currently DCOS is running spark 2 so I not sure if there is a modifcation
to code with the upgrade

my code atm looks like this


sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxx")

  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val fname = "s3n://somespark/datain.csv"
  // val rows = sc.textFile(fname).map { line =>
  // val values = line.split(',').map(_.toDouble)
  // Vectors.dense(values)
  // }


  val rows = sc.textFile(fname)
  rows.count()



the spark survice returns a failed message - but little information to
exactly why the job didnt run


any suggestions to what i an try?
-- 
M