Re: HBase / Spark Kerberos problem

2016-05-19 Thread John Trengrove
Have you had a look at this issue?

https://issues.apache.org/jira/browse/SPARK-12279

There is a comment by Y Bodnar on how they successfully got Kerberos and
HBase working.

2016-05-18 18:13 GMT+10:00 :

> Hi all,
>
> I have been puzzling over a Kerberos problem for a while now and wondered
> if anyone can help.
>
> For spark-submit, I specify --keytab x --principal y, which creates my
> SparkContext fine.
> Connections to Zookeeper Quorum to find the HBase master work well too.
> But when it comes to a .count() action on the RDD, I am always presented
> with the stack trace at the end of this mail.
>
> We are using CDH5.5.2 (spark 1.5.0), and
> com.cloudera.spark.hbase.HBaseContext is a wrapper around
> TableInputFormat/hadoopRDD (see
> https://github.com/cloudera-labs/SparkOnHBase), as you can see in the
> stack trace.
>
> Am I doing something obvious wrong here?
> A similar flow, inside test code, works well, only going via spark-submit
> exposes this issue.
>
> Code snippet (I have tried using the commented-out lines in various
> combinations, without success):
>
>val conf = new SparkConf().
>   set("spark.shuffle.consolidateFiles", "true").
>   set("spark.kryo.registrationRequired", "false").
>   set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer").
>   set("spark.kryoserializer.buffer", "30m")
> val sc = new SparkContext(conf)
> val cfg = sc.hadoopConfiguration
> //cfg.addResource(new
> org.apache.hadoop.fs.Path("/etc/hbase/conf/hbase-site.xml"))
> //
> UserGroupInformation.getCurrentUser.setAuthenticationMethod(UserGroupInformation.AuthenticationMethod.KERBEROS)
> //cfg.set("hbase.security.authentication", "kerberos")
> val hc = new HBaseContext(sc, cfg)
> val scan = new Scan
> scan.setTimeRange(startMillis, endMillis)
> val matchesInRange = hc.hbaseRDD(MY_TABLE, scan, resultToMatch)
> val cnt = matchesInRange.count()
> log.info(s"matches in range $cnt")
>
> Stack trace / log:
>
> 16/05/17 17:04:47 INFO SparkContext: Starting job: count at
> Analysis.scala:93
> 16/05/17 17:04:47 INFO DAGScheduler: Got job 0 (count at
> Analysis.scala:93) with 1 output partitions
> 16/05/17 17:04:47 INFO DAGScheduler: Final stage: ResultStage 0(count at
> Analysis.scala:93)
> 16/05/17 17:04:47 INFO DAGScheduler: Parents of final stage: List()
> 16/05/17 17:04:47 INFO DAGScheduler: Missing parents: List()
> 16/05/17 17:04:47 INFO DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[1] at map at HBaseContext.scala:580), which has no
> missing parents
> 16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(3248) called with
> curMem=428022, maxMem=244187136
> 16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3 stored as values in
> memory (estimated size 3.2 KB, free 232.5 MB)
> 16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(2022) called with
> curMem=431270, maxMem=244187136
> 16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3_piece0 stored as
> bytes in memory (estimated size 2022.0 B, free 232.5 MB)
> 16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in
> memory on 10.6.164.40:33563 (size: 2022.0 B, free: 232.8 MB)
> 16/05/17 17:04:47 INFO SparkContext: Created broadcast 3 from broadcast at
> DAGScheduler.scala:861
> 16/05/17 17:04:47 INFO DAGScheduler: Submitting 1 missing tasks from
> ResultStage 0 (MapPartitionsRDD[1] at map at HBaseContext.scala:580)
> 16/05/17 17:04:47 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
> 16/05/17 17:04:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, hpg-dev-vm, partition 0,PROCESS_LOCAL, 2208 bytes)
> 16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in
> memory on hpg-dev-vm:52698 (size: 2022.0 B, free: 388.4 MB)
> 16/05/17 17:04:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in
> memory on hpg-dev-vm:52698 (size: 26.0 KB, free: 388.4 MB)
> 16/05/17 17:04:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> hpg-dev-vm): org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Can't get the location
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:308)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:155)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:63)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:289)
> at
> org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:161)
> at
> org.apache.hadoop.hbase.client.ClientScanner.(ClientScanner.java:156)
> at
> 

Re: How to get the batch information from Streaming UI

2016-05-16 Thread John Trengrove
You would want to add a listener to your Spark Streaming context. Have a
look at the StatsReportListener [1].

[1]
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StatsReportListener

2016-05-17 7:18 GMT+10:00 Samuel Zhou :

> Hi,
>
> Does anyone know how to get the batch information(like batch time, input
> size, processing time, status) from Streaming UI by using Scala/Java API ?
> Because I want to put the information in log files and the streaming jobs
> are managed by YARN.
>
> Thanks,
> Samuel
>


Re: Silly Question on my part...

2016-05-16 Thread John Trengrove
If you are wanting to share RDDs it might be a good idea to check out
Tachyon / Alluxio.

For the Thrift server, I believe the datasets are located in your Spark
cluster as RDDs and you just communicate with it via the Thrift
JDBC Distributed Query Engine connector.

2016-05-17 5:12 GMT+10:00 Michael Segel :

> For one use case.. we were considering using the thrift server as a way to
> allow multiple clients access shared RDDs.
>
> Within the Thrift Context, we create an RDD and expose it as a hive table.
>
> The question  is… where does the RDD exist. On the Thrift service node
> itself, or is that just a reference to the RDD which is contained with
> contexts on the cluster?
>
>
> Thx
>
> -Mike
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to use the spark submit script / capability

2016-05-15 Thread John Trengrove
Assuming you are refering to running SparkSubmit.main programatically
otherwise read this [1].

I can't find any scaladocs for org.apache.spark.deploy.* but Oozie's [2]
example of using SparkSubmit is pretty comprehensive.

[1] http://spark.apache.org/docs/latest/submitting-applications.html
[2]
https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java

John

2016-05-16 2:33 GMT+10:00 Stephen Boesch :

>
> There is a committed PR from Marcelo Vanzin addressing that capability:
>
> https://github.com/apache/spark/pull/3916/files
>
> Is there any documentation on how to use this?  The PR itself has two
> comments asking for the docs that were not answered.
>


Re: VectorAssembler handling null values

2016-04-20 Thread John Trengrove
You could handle null values by using the DataFrame.na functions in a
preprocessing step like DataFrame.na.fill().

For reference:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

John

On 21 April 2016 at 03:41, Andres Perez  wrote:

> so the missing data could be on a one-off basis, or from fields that are
> in general optional, or from, say, a count that is only relevant for
> certain cases (very sparse):
>
> f1|f2|f3|optF1|optF2|sparseF1
> a|15|3.5|cat1|142L|
> b|13|2.4|cat2|64L|catA
> c|2|1.6|||
> d|27|5.1||0|
>
> -Andy
>
> On Wed, Apr 20, 2016 at 1:38 AM, Nick Pentreath 
> wrote:
>
>> Could you provide an example of what your input data looks like?
>> Supporting missing values in a sparse result vector makes sense.
>>
>> On Tue, 19 Apr 2016 at 23:55, Andres Perez  wrote:
>>
>>> Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently
>>> cannot handle null values. This presents a problem for us as we wish to run
>>> a decision tree classifier on sometimes sparse data. Is there a particular
>>> reason VectorAssembler is implemented in this way, and can anyone recommend
>>> the best path for enabling VectorAssembler to build vectors for data that
>>> will contain empty values?
>>>
>>> Thanks!
>>>
>>> -Andres
>>>
>>>
>