from:"Benjamin Kim"

[jira] [Commented] (YARN-6214) NullPointer Exception while querying timeline server API

2020-03-10 Thread Benjamin Kim (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056315#comment-17056315
 ] 

Benjamin Kim commented on YARN-6214:


The root cause if one of the apps is in init status, some of properties like 
application type is set to null. So if you make API call with `state=FINISHED` 
http parameter, you won't face this issue. 

However, we probably need better error handling logic.

 

> NullPointer Exception while querying timeline server API
> 
>
> Key: YARN-6214
> URL: https://issues.apache.org/jira/browse/YARN-6214
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.1
>Reporter: Ravi Teja Chilukuri
>Priority: Major
>
> The apps API works fine and give all applications, including Mapreduce and Tez
> http://:8188/ws/v1/applicationhistory/apps
> But when queried with application types with these APIs, it fails with 
> NullpointerException.
> http://:8188/ws/v1/applicationhistory/apps?applicationTypes=TEZ
> http://:8188/ws/v1/applicationhistory/apps?applicationTypes=MAPREDUCE
> NullPointerExceptionjava.lang.NullPointerException
> Blocked on this issue as we are not able to run analytics on the tez job 
> counters on the prod jobs. 
> Timeline Logs:
> |2017-02-22 11:47:57,183 WARN  webapp.GenericExceptionHandler 
> (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApps(WebServices.java:195)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices.getApps(AHSWebServices.java:96)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
> Complete stacktrace:
> http://pastebin.com/bRgxVabf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6214) NullPointer Exception while querying timeline server API

2020-02-27 Thread Benjamin Kim (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047125#comment-17047125
 ] 

Benjamin Kim commented on YARN-6214:


It happened to me,

 
{code:java}
{"exception": "NullPointerException","javaClassName": 
"java.lang.NullPointerException"}{code}
Using 2.8.4, as Jason noted, it happens while checking app types.
{code:java}
2020-02-28 09:52:20,041 WARN 
org.apache.hadoop.yarn.webapp.GenericExceptionHandler 
(2070044461@qtp-1305004711-22): INTERNAL_SERVER_ERROR2020-02-28 09:52:20,041 
WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler 
(2070044461@qtp-1305004711-22): 
INTERNAL_SERVER_ERRORjava.lang.NullPointerException at 
org.apache.hadoop.yarn.server.webapp.WebServices.getApps(WebServices.java:199) 
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices.getApps(AHSWebServices.java:96)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
 at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
 at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
 at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
 at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
 at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) 
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
 at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
 at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
 at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
 at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:294)
 at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
 at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
org.apache.hadoop.security.http.CrossOriginFilter.doFilter(CrossOriginFilter.java:95)
 at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353)
 at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at 
org.mortbay.jetty.security.Securit

[spyder] Spyder 3.3.1 in Anacanda Navigator 1.8.7 Autocomplete and Online Help are not working

2018-08-15 Thread Benjamin Kim

I just started a Data Science class where they use Spyder as the IDE. After 
installing the latest Anaconda on my Macbook Pro with High Sierra and 
updating Spyder to 3.3.1, I got Spyder to launch fine. But, when I try to 
get information about objects and methods (cmd-i), nothing comes up. Also, 
autocomplete doesn't work either. When I hit the period, I expect a list of 
available options to show up, but nothing shows up. Can somebody help me 
with getting this to work? How can I find out if either all the 
dependencies are installed or if something isn't configured correctly?

Thank you,
Ben

-- 
You received this message because you are subscribed to the Google Groups 
"spyder" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to spyderlib+unsubscr...@googlegroups.com.
To post to this group, send email to spyderlib@googlegroups.com.
Visit this group at https://groups.google.com/group/spyderlib.
For more options, visit https://groups.google.com/d/optout.

Re: Append In-Place to S3

2018-06-07 Thread Benjamin Kim

I tried a different tactic. I still append based on the query below, but I add 
another deduping step afterwards, writing to a staging directory then 
overwriting back. Luckily, the data is small enough for this to happen fast.

Cheers,
Ben

> On Jun 3, 2018, at 3:02 PM, Tayler Lawrence Jones  
> wrote:
> 
> Sorry actually my last message is not true for anti join, I was thinking of 
> semi join. 
> 
> -TJ
> 
> On Sun, Jun 3, 2018 at 14:57 Tayler Lawrence Jones  <mailto:t.jonesd...@gmail.com>> wrote:
> A left join with null filter is only the same as a left anti join if the join 
> keys can be guaranteed unique in the existing data. Since hive tables on s3 
> offer no unique guarantees outside of your processing code, I recommend using 
> left anti join over left join + null filter.
> 
> -TJ
> 
> On Sun, Jun 3, 2018 at 14:47 ayan guha  <mailto:guha.a...@gmail.com>> wrote:
> I do not use anti join semantics, but you can use left outer join and then 
> filter out nulls from right side. Your data may have dups on the columns 
> separately but it should not have dups on the composite key ie all columns 
> put together.
> 
> On Mon, 4 Jun 2018 at 6:42 am, Tayler Lawrence Jones  <mailto:t.jonesd...@gmail.com>> wrote:
> The issue is not the append vs overwrite - perhaps those responders do not 
> know Anti join semantics. Further, Overwrite on s3 is a bad pattern due to s3 
> eventual consistency issues. 
> 
> First, your sql query is wrong as you don’t close the parenthesis of the CTE 
> (“with” part). In fact, it looks like you don’t need that with at all, and 
> the query should fail to parse. If that does parse, I would open a bug on the 
> spark jira.
> 
> Can you provide the query that you are using to detect duplication so I can 
> see if your deduplication logic matches the detection query? 
> 
> -TJ
> 
> On Sat, Jun 2, 2018 at 10:22 Aakash Basu  <mailto:aakash.spark@gmail.com>> wrote:
> As Jay suggested correctly, if you're joining then overwrite otherwise only 
> append as it removes dups.
> 
> I think, in this scenario, just change it to write.mode('overwrite') because 
> you're already reading the old data and your job would be done.
> 
> 
> On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim,  <mailto:bbuil...@gmail.com>> wrote:
> Hi Jay,
> 
> Thanks for your response. Are you saying to append the new data and then 
> remove the duplicates to the whole data set afterwards overwriting the 
> existing data set with new data set with appended values? I will give that a 
> try. 
> 
> Cheers,
> Ben
> 
> On Fri, Jun 1, 2018 at 11:49 PM Jay  <mailto:jayadeep.jayara...@gmail.com>> wrote:
> Benjamin,
> 
> The append will append the "new" data to the existing data with removing the 
> duplicates. You would need to overwrite the file everytime if you need unique 
> values.
> 
> Thanks,
> Jayadeep
> 
> On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I have a situation where I trying to add only new rows to an existing data 
> set that lives in S3 as gzipped parquet files, looping and appending for each 
> hour of the day. First, I create a DF from the existing data, then I use a 
> query to create another DF with the data that is new. Here is the code 
> snippet.
> 
> df = spark.read.parquet(existing_data_path)
> df.createOrReplaceTempView(‘existing_data’)
> new_df = spark.read.parquet(new_data_path)
> new_df.createOrReplaceTempView(’new_data’)
> append_df = spark.sql(
> """
> WITH ids AS (
> SELECT DISTINCT
> source,
> source_id,
> target,
> target_id
> FROM new_data i
> LEFT ANTI JOIN existing_data im
> ON i.source = im.source
> AND i.source_id = im.source_id
> AND i.target = im.target
> AND i.target = im.target_id
> """
> )
> append_df.coalesce(1).write.parquet(existing_data_path, mode='append', 
> compression='gzip’)
> 
> I thought this would append new rows and keep the data unique, but I am see 
> many duplicates. Can someone help me with this and tell me what I am doing 
> wrong?
> 
> Thanks,
> Ben
> -- 
> Best Regards,
> Ayan Guha

Re: Zeppelin 0.8

2018-06-07 Thread Benjamin Kim

Can anyone tell me what the status is for 0.8 release?

> On May 2, 2018, at 4:43 PM, Jeff Zhang  wrote:
> 
> 
> Yes, 0.8 will support spark 2.3
> 
> Benjamin Kim mailto:bbuil...@gmail.com>>于2018年5月3日周四 
> 上午1:59写道：
> Will Zeppelin 0.8 have Spark 2.3 support?
> 
>> On Apr 30, 2018, at 1:27 AM, Rotem Herzberg > <mailto:rotem.herzb...@gigaspaces.com>> wrote:
>> 
>> Thanks
>> 
>> On Mon, Apr 30, 2018 at 11:16 AM, Jeff Zhang > <mailto:zjf...@gmail.com>> wrote:
>> 
>> I am preparing the RC for 0.8
>> 
>> 
>> Rotem Herzberg > <mailto:rotem.herzb...@gigaspaces.com>>于2018年4月30日周一 下午3:57写道：
>> Hi,
>> 
>> What is the release date for Zeppelin 0.8? (support for spark 2.3)
>> 
>> Thanks,
>> 
>> -- 
>>  <http://www.gigaspaces.com/?utm_source=Signature_medium=Email>  
>> Rotem Herzberg
>> SW Engineer | GigaSpaces Technologies
>>  
>> rotem.herzb...@gigaspaces.com <mailto:rotem.herzb...@gigaspaces.com>   | M 
>> +972547718880 <> 
>> 
>>   <https://twitter.com/gigaspaces>
>> <https://www.linkedin.com/company/gigaspaces>
>> <https://www.facebook.com/gigaspaces>
>> 
>> 
>> -- 
>>  <http://www.gigaspaces.com/?utm_source=Signature_medium=Email>  
>> Rotem Herzberg
>> SW Engineer | GigaSpaces Technologies
>>  
>> rotem.herzb...@gigaspaces.com <mailto:rotem.herzb...@gigaspaces.com>   | M 
>> +972547718880 <> 
>> 
>>   <https://twitter.com/gigaspaces>
>> <https://www.linkedin.com/company/gigaspaces>
>> <https://www.facebook.com/gigaspaces>

Re: Credentials for JDBC

2018-06-07 Thread Benjamin Kim

Hi 종열,

Can you show me how?

Thanks,
Ben


> On Jun 6, 2018, at 10:32 PM, Jongyoul Lee  wrote:
> 
> We have a trick to get credential information from a credential page. I'll 
> take into it.
> 
> On Thu, Jun 7, 2018 at 7:53 AM, Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I created a JDBC interpreter for AWS Athena, and it passes the access key as 
> UID and secret key as PWD in the URL connection string. Does anyone know if I 
> can setup each user to pass their own credentials in a, sort of, credentials 
> file or config?
> 
> Thanks,
> Ben
> 
> 
> 
> -- 
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net <http://madeng.net/>

Credentials for JDBC

2018-06-06 Thread Benjamin Kim

I created a JDBC interpreter for AWS Athena, and it passes the access key as 
UID and secret key as PWD in the URL connection string. Does anyone know if I 
can setup each user to pass their own credentials in a, sort of, credentials 
file or config?

Thanks,
Ben

Re: Append In-Place to S3

2018-06-02 Thread Benjamin Kim

Hi Jay,

Thanks for your response. Are you saying to append the new data and then
remove the duplicates to the whole data set afterwards overwriting the
existing data set with new data set with appended values? I will give that
a try.

Cheers,
Ben

On Fri, Jun 1, 2018 at 11:49 PM Jay  wrote:

> Benjamin,
>
> The append will append the "new" data to the existing data with removing
> the duplicates. You would need to overwrite the file everytime if you need
> unique values.
>
> Thanks,
> Jayadeep
>
> On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim  wrote:
>
>> I have a situation where I trying to add only new rows to an existing
>> data set that lives in S3 as gzipped parquet files, looping and appending
>> for each hour of the day. First, I create a DF from the existing data, then
>> I use a query to create another DF with the data that is new. Here is the
>> code snippet.
>>
>> df = spark.read.parquet(existing_data_path)
>> df.createOrReplaceTempView(‘existing_data’)
>> new_df = spark.read.parquet(new_data_path)
>> new_df.createOrReplaceTempView(’new_data’)
>> append_df = spark.sql(
>> """
>> WITH ids AS (
>> SELECT DISTINCT
>> source,
>> source_id,
>> target,
>> target_id
>> FROM new_data i
>> LEFT ANTI JOIN existing_data im
>> ON i.source = im.source
>> AND i.source_id = im.source_id
>> AND i.target = im.target
>> AND i.target = im.target_id
>> """
>> )
>> append_df.coalesce(1).write.parquet(existing_data_path, mode='append',
>> compression='gzip’)
>>
>>
>> I thought this would append new rows and keep the data unique, but I am
>> see many duplicates. Can someone help me with this and tell me what I am
>> doing wrong?
>>
>> Thanks,
>> Ben
>>
>

Append In-Place to S3

2018-06-01 Thread Benjamin Kim

I have a situation where I trying to add only new rows to an existing data set 
that lives in S3 as gzipped parquet files, looping and appending for each hour 
of the day. First, I create a DF from the existing data, then I use a query to 
create another DF with the data that is new. Here is the code snippet.

df = spark.read.parquet(existing_data_path)
df.createOrReplaceTempView(‘existing_data’)
new_df = spark.read.parquet(new_data_path)
new_df.createOrReplaceTempView(’new_data’)
append_df = spark.sql(
"""
WITH ids AS (
SELECT DISTINCT
source,
source_id,
target,
target_id
FROM new_data i
LEFT ANTI JOIN existing_data im
ON i.source = im.source
AND i.source_id = im.source_id
AND i.target = im.target
AND i.target = im.target_id
"""
)
append_df.coalesce(1).write.parquet(existing_data_path, mode='append', 
compression='gzip’)

I thought this would append new rows and keep the data unique, but I am see 
many duplicates. Can someone help me with this and tell me what I am doing 
wrong?

Thanks,
Ben

Re: Zeppelin 0.8

2018-05-02 Thread Benjamin Kim

Will Zeppelin 0.8 have Spark 2.3 support?

> On Apr 30, 2018, at 1:27 AM, Rotem Herzberg  
> wrote:
> 
> Thanks
> 
> On Mon, Apr 30, 2018 at 11:16 AM, Jeff Zhang  > wrote:
> 
> I am preparing the RC for 0.8
> 
> 
> Rotem Herzberg  >于2018年4月30日周一 下午3:57写道：
> Hi,
> 
> What is the release date for Zeppelin 0.8? (support for spark 2.3)
> 
> Thanks,
> 
> -- 
>     
> Rotem Herzberg
> SW Engineer | GigaSpaces Technologies
>  
> rotem.herzb...@gigaspaces.com    | M 
> +972547718880 <> 
> 
>   
> 
> 
> 
> 
> -- 
>     
> Rotem Herzberg
> SW Engineer | GigaSpaces Technologies
>  
> rotem.herzb...@gigaspaces.com    | M 
> +972547718880 <> 
> 
>   
> 
>

Re: Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim

To add, we have a CDH 5.12 cluster with Spark 2.2 in our data center.

On Mon, Nov 13, 2017 at 3:15 PM Benjamin Kim <bbuil...@gmail.com> wrote:

> Does anyone know if there is a connector for AWS Kinesis that can be used
> as a source for Structured Streaming?
>
> Thanks.
>
>

Databricks Serverless

2017-11-13 Thread Benjamin Kim

I have a question about this. The documentation compares the concept
similar to BigQuery. Does this mean that we will no longer need to deal
with instances and just pay for execution duration and amount of data
processed? I’m just curious about how this will be priced.

Also, when will it be ready for production?

Cheers.

Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim

Does anyone know if there is a connector for AWS Kinesis that can be used
as a source for Structured Streaming?

Thanks.

Serverless ETL

2017-10-17 Thread Benjamin Kim

With AWS having Glue and GCE having Dataprep, is Databricks coming out with
an equivalent or better? I know that Serverless is a new offering, but will
it go farther with automatic data schema discovery, profiling, metadata
storage, change triggering, joining, transform suggestions, etc.?

Just curious.

Cheers,
Ben

DMP/CDP Profile Store

2017-08-30 Thread Benjamin Kim

I was wondering has anyone worked on a DMP/CDP for storing user and
customer profiles in Kudu. Each user will have their base ID's aka identity
graph along with statistics based on their attributes along with tables for
these attributes grouped by category.

Please let me know what you think of my thoughts.

I was thinking of creating a base profile table to store the ID's and
statistics along with unchanging or rarely changing attributes, such as
name, that do not need to be tracked. Next, I would create tables to
categorize groups of attributes, such as user information, behaviors,
geolocation, devices, etc. These attribute tables would have columns for
each attribute and would track changes by only inserting data via a time
stamp column to know when it was entered. Essentially, I would follow the
type 2 slowly changing dimension operandi for data warehouses. For
attributes that expire, we will partition by a time range so that we can
drop off expired data. For attributes where we only need to latest one, we
would add an active column to easily flag and query them after inactivating
older versions.

Any comments or advice would be truly appreciated.

Cheers,
Ben

Re: Configure Impala for Kudu on Separate Cluster

2017-08-18 Thread Benjamin Kim

Todd,

I'll keep this in mind. This information will be useful. I'll try again.

Thanks,
Ben


On Wed, Aug 16, 2017 at 4:32 PM Todd Lipcon <t...@cloudera.com> wrote:

> On Wed, Aug 16, 2017 at 6:16 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Hi,
>>
>> I found 2 issues. First, network connection is blocked. I filed a request
>> to open it. Second, Kudu 1.4 had problems registering any tablet server.
>> So, I reverted back to 1.3.1. The error had to do security and how IP
>> routing was not trusted.
>>
>
> Can you clarify what problem you are hitting? Is this the security
> improvement that prevents from running with public IP addresses without
> security enabled? You can override this if you're sure that your network is
> secured by using the --trusted-subnets option (see the security docs for
> details)
>
> -Todd
>
>
>>
>> Also, I installed Kudu onto another cluster where I knew the network
>> would not be a problem, and it worked.
>>
>> Cheers,
>> Ben
>>
>> On Wed, Aug 16, 2017 at 12:44 AM Alexey Serbin <aser...@cloudera.com>
>> wrote:
>>
>>> Ben,
>>>
>>> As Todd mentioned, it might be some network connectivity problem. I
>>> would suspect some issues with connectivity between the node where the
>>> Impala shell is running and the Kudu master node.
>>>
>>> To start troubleshooting, I would verify that the node where you run the
>>> Impala shell (that's 172.35.120.191, right?) can establish a TCP
>>> connection to the master RPC end-point.  E.g., try to run from the
>>> command shell at 172.35.120.191:
>>>
>>>telnet 172.35.121.101 7051
>>>
>>> Would it succeed?
>>>
>>> Also, if running multi-master Kudu cluster, it might happen that masters
>>> cannot communicate with each other.  To troubleshoot that, I would try
>>> to establish a TCP connection to the RPC end-point of the master at one
>>> node from another master node.  E.g., if using telnet, from
>>> , in the command-line shell:
>>>
>>>telnet  7051
>>>(just substitute  and  with appropriate
>>> hostnames/IP addresses).
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Alexey
>>>
>>>
>>> On 8/15/17 9:53 PM, Benjamin Kim wrote:
>>> > Todd,
>>> >
>>> > Caused by: org.apache.kudu.client.NoLeaderMasterFoundException: Master
>>> > config (prod-dc1-datanode151.pdc1i.gradientx.com:7051
>>> > <http://prod-dc1-datanode151.pdc1i.gradientx.com:7051/>) has no
>>> > leader. Exceptions received:
>>> > org.apache.kudu.client.RecoverableException: [Peer Kudu Master -
>>> > prod-dc1-datanode151.pdc1i.gradientx.com:7051
>>> > <http://prod-dc1-datanode151.pdc1i.gradientx.com:7051/>] Connection
>>> > reset on [id: 0x6232f33f, /172.35.120.191:47848
>>> > <http://172.35.120.191:47848/> :> /172.35.121.101:7051
>>> > <http://172.35.121.101:7051/>]
>>> >
>>> > We got this error trying to use Kudu from within the cluster. Do you
>>> > know what this means?
>>> >
>>> > Cheers,
>>> > Ben
>>> >
>>> >
>>> > On Tue, Aug 15, 2017 at 12:40 AM Todd Lipcon <t...@cloudera.com
>>> > <mailto:t...@cloudera.com>> wrote:
>>> >
>>> > Is there a possibility that the remote node (prod-dc1-datanode151)
>>> > is firewalled off from whatever host you are submitting the query
>>> > to? The error message is admittedly pretty bad, but it basically
>>> > means it's getting "connection refused", indicating that either
>>> > there is no master running on that host or it has been blocked (eg
>>> > an iptables REJECT rule)
>>> >
>>> > -Todd
>>> >
>>> > On Mon, Aug 14, 2017 at 10:36 PM, Benjamin Kim <bbuil...@gmail.com
>>> > <mailto:bbuil...@gmail.com>> wrote:
>>> >
>>> > Hi Todd,
>>> >
>>> > I tried to create a Kudu table using impala shell, and I got
>>> > this error.
>>> >
>>> > create table my_first_table
>>> > (
>>> >   id bigint,
>>> >   name string,
>>> >   primary key(id)
>>> > )
>>> > partition by hash partitions 16

Re: Configure Impala for Kudu on Separate Cluster

2017-08-15 Thread Benjamin Kim

Todd,

Caused by: org.apache.kudu.client.NoLeaderMasterFoundException: Master
config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
Kudu Master - prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection
reset on [id: 0x6232f33f, /172.35.120.191:47848 :> /172.35.121.101:7051]

We got this error trying to use Kudu from within the cluster. Do you know
what this means?

Cheers,
Ben


On Tue, Aug 15, 2017 at 12:40 AM Todd Lipcon <t...@cloudera.com> wrote:

> Is there a possibility that the remote node (prod-dc1-datanode151) is
> firewalled off from whatever host you are submitting the query to? The
> error message is admittedly pretty bad, but it basically means it's getting
> "connection refused", indicating that either there is no master running on
> that host or it has been blocked (eg an iptables REJECT rule)
>
> -Todd
>
> On Mon, Aug 14, 2017 at 10:36 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Hi Todd,
>>
>> I tried to create a Kudu table using impala shell, and I got this error.
>>
>> create table my_first_table
>> (
>>   id bigint,
>>   name string,
>>   primary key(id)
>> )
>> partition by hash partitions 16
>> stored as kudu;
>> Query: create table my_first_table
>> (
>>   id bigint,
>>   name string,
>>   primary key(id)
>> )
>> partition by hash partitions 16
>> stored as kudu
>> ERROR: ImpalaRuntimeException: Error creating Kudu table
>> 'impala::default.my_first_table'
>> CAUSED BY: NonRecoverableException: Too many attempts:
>> KuduRpc(method=ListTables, tablet=null, attempt=101,
>> DeadlineTracker(timeout=18, elapsed=178226), Traces: [0ms] querying
>> master, [1ms] Sub rpc: ConnectToMaster sending RPC to server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [2ms] Sub rpc:
>> ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [5ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [21ms] querying master, [22ms] Sub rpc: ConnectToMaster sending RPC to
>> server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [24ms] Sub
>> rpc: ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [26ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [41ms] querying master, [41ms] Sub rpc: ConnectToMaster sending RPC to
>> server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [43ms] Sub
>> rpc: ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [46ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [62ms] querying master, [62ms] Sub rpc: ConnectToMaster sending RPC to
>> server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [64ms] Sub
>> rpc: ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [66ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [81ms] querying master, [81ms] Sub rpc: ConnectToMaster sending RPC to
>> server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [84ms] Sub
>> rpc: ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradient

Re: Configure Impala for Kudu on Separate Cluster

2017-08-15 Thread Benjamin Kim

Todd,

I'll check this with the systems team.

Cheers,
Ben


On Mon, Aug 14, 2017 at 10:40 PM Todd Lipcon <t...@cloudera.com> wrote:

> Is there a possibility that the remote node (prod-dc1-datanode151) is
> firewalled off from whatever host you are submitting the query to? The
> error message is admittedly pretty bad, but it basically means it's getting
> "connection refused", indicating that either there is no master running on
> that host or it has been blocked (eg an iptables REJECT rule)
>
> -Todd
>
> On Mon, Aug 14, 2017 at 10:36 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Hi Todd,
>>
>> I tried to create a Kudu table using impala shell, and I got this error.
>>
>> create table my_first_table
>> (
>>   id bigint,
>>   name string,
>>   primary key(id)
>> )
>> partition by hash partitions 16
>> stored as kudu;
>> Query: create table my_first_table
>> (
>>   id bigint,
>>   name string,
>>   primary key(id)
>> )
>> partition by hash partitions 16
>> stored as kudu
>> ERROR: ImpalaRuntimeException: Error creating Kudu table
>> 'impala::default.my_first_table'
>> CAUSED BY: NonRecoverableException: Too many attempts:
>> KuduRpc(method=ListTables, tablet=null, attempt=101,
>> DeadlineTracker(timeout=18, elapsed=178226), Traces: [0ms] querying
>> master, [1ms] Sub rpc: ConnectToMaster sending RPC to server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [2ms] Sub rpc:
>> ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [5ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [21ms] querying master, [22ms] Sub rpc: ConnectToMaster sending RPC to
>> server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [24ms] Sub
>> rpc: ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [26ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [41ms] querying master, [41ms] Sub rpc: ConnectToMaster sending RPC to
>> server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [43ms] Sub
>> rpc: ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [46ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [62ms] querying master, [62ms] Sub rpc: ConnectToMaster sending RPC to
>> server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [64ms] Sub
>> rpc: ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [66ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [81ms] querying master, [81ms] Sub rpc: ConnectToMaster sending RPC to
>> server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [84ms] Sub
>> rpc: ConnectToMaster received from server
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051 response Network
>> error: [Peer master-prod-dc1-datanode151.pdc1i.gradientx.com:7051]
>> Connection closed, [86ms] delaying RPC due to Service unavailable: Master
>> config (prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader.
>> Exceptions received: org.apache.kudu.client.RecoverableException: [Peer
>> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
>> [122ms] querying master, [122ms]

Re: Cloudera Spark 2.2

2017-08-04 Thread Benjamin Kim

Hi Ruslan,

Can you send me the steps you used to build it, especially the Maven
command with the arguments? I will try to build it also.

I do believe that the binaries are for official releases.

Cheers,
Ben


On Wed, Aug 2, 2017 at 3:44 PM Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> It was built. I think binaries are only available for official releases?
>
>
>
> --
> Ruslan Dautkhanov
>
> On Wed, Aug 2, 2017 at 4:41 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Did you build Zeppelin or download the binary?
>>
>> On Wed, Aug 2, 2017 at 3:40 PM Ruslan Dautkhanov <dautkha...@gmail.com>
>> wrote:
>>
>>> We're using an ~April snapshot of Zeppelin, so not sure about 0.7.1.
>>>
>>> Yes, we have that spark home in zeppelin-env.sh
>>>
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> On Wed, Aug 2, 2017 at 4:31 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>>> Does this work with Zeppelin 0.7.1? We an error when setting SPARK_HOME
>>>> in zeppelin-env.sh to what you have below.
>>>>
>>>> On Wed, Aug 2, 2017 at 3:24 PM Ruslan Dautkhanov <dautkha...@gmail.com>
>>>> wrote:
>>>>
>>>>> You don't have to use spark2-shell and spark2-submit to use Spark 2.
>>>>> That can be controled by setting SPARK_HOME using regular
>>>>> spark-submit/spark-shell.
>>>>>
>>>>> $ which spark-submit
>>>>> /usr/bin/spark-submit
>>>>> $ which spark-shell
>>>>> /usr/bin/spark-shell
>>>>>
>>>>> $ spark-shell
>>>>> Welcome to
>>>>>     __
>>>>>  / __/__  ___ _/ /__
>>>>> _\ \/ _ \/ _ `/ __/  '_/
>>>>>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>>>>       /_/
>>>>>
>>>>>
>>>>>
>>>>> $ export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
>>>>>
>>>>> $ spark-shell
>>>>> Welcome to
>>>>>     __
>>>>>  / __/__  ___ _/ /__
>>>>> _\ \/ _ \/ _ `/ __/  '_/
>>>>>/___/ .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
>>>>>   /_/
>>>>>
>>>>>
>>>>> spark-submit and spark-shell are just shell script wrappers.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ruslan Dautkhanov
>>>>>
>>>>> On Wed, Aug 2, 2017 at 10:22 AM, Benjamin Kim <bbuil...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> According to the Zeppelin documentation, Zeppelin 0.7.1 supports
>>>>>> Spark 2.1. But, I don't know if it supports Spark 2.2 or even 2.1 from
>>>>>> Cloudera. For some reason, Cloudera defaults to Spark 1.6 and so does the
>>>>>> calls to spark-shell and spark-submit. To force the use of Spark 2.x, the
>>>>>> calls need to be spark2-shell and spark2-submit. I wonder if this is
>>>>>> causing the problem. By the way, we are using Java8 corporate wide, and
>>>>>> there seems to be no problems using Zeppelin.
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>>
>>>>>> On Tue, Aug 1, 2017 at 7:05 PM Ruslan Dautkhanov <
>>>>>> dautkha...@gmail.com> wrote:
>>>>>>
>>>>>>> Might need to recompile Zeppelin with Scala 2.11?
>>>>>>> Also Spark 2.2 now requires JDK8 I believe.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ruslan Dautkhanov
>>>>>>>
>>>>>>> On Tue, Aug 1, 2017 at 6:26 PM, Benjamin Kim <bbuil...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here is more.
>>>>>>>>
>>>>>>>> org.apache.zeppelin.interpreter.InterpreterException: WARNING:
>>>>>>>> User-defined SPARK_HOME
>>>>>>>> (/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2)
>>>>>>>> overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2).
>>>>>>>> WARNING: Running spark-class from user-defined location.
>>>>>>>> Exception

Re: Cloudera Spark 2.2

2017-08-02 Thread Benjamin Kim

Did you build Zeppelin or download the binary?

On Wed, Aug 2, 2017 at 3:40 PM Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> We're using an ~April snapshot of Zeppelin, so not sure about 0.7.1.
>
> Yes, we have that spark home in zeppelin-env.sh
>
>
>
> --
> Ruslan Dautkhanov
>
> On Wed, Aug 2, 2017 at 4:31 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Does this work with Zeppelin 0.7.1? We an error when setting SPARK_HOME
>> in zeppelin-env.sh to what you have below.
>>
>> On Wed, Aug 2, 2017 at 3:24 PM Ruslan Dautkhanov <dautkha...@gmail.com>
>> wrote:
>>
>>> You don't have to use spark2-shell and spark2-submit to use Spark 2.
>>> That can be controled by setting SPARK_HOME using regular
>>> spark-submit/spark-shell.
>>>
>>> $ which spark-submit
>>> /usr/bin/spark-submit
>>> $ which spark-shell
>>> /usr/bin/spark-shell
>>>
>>> $ spark-shell
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>>   /_/
>>>
>>>
>>>
>>> $ export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
>>>
>>> $ spark-shell
>>> Welcome to
>>>         __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
>>>   /_/
>>>
>>>
>>> spark-submit and spark-shell are just shell script wrappers.
>>>
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> On Wed, Aug 2, 2017 at 10:22 AM, Benjamin Kim <bbuil...@gmail.com>
>>> wrote:
>>>
>>>> According to the Zeppelin documentation, Zeppelin 0.7.1 supports Spark
>>>> 2.1. But, I don't know if it supports Spark 2.2 or even 2.1 from Cloudera.
>>>> For some reason, Cloudera defaults to Spark 1.6 and so does the calls to
>>>> spark-shell and spark-submit. To force the use of Spark 2.x, the calls need
>>>> to be spark2-shell and spark2-submit. I wonder if this is causing the
>>>> problem. By the way, we are using Java8 corporate wide, and there seems to
>>>> be no problems using Zeppelin.
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>> On Tue, Aug 1, 2017 at 7:05 PM Ruslan Dautkhanov <dautkha...@gmail.com>
>>>> wrote:
>>>>
>>>>> Might need to recompile Zeppelin with Scala 2.11?
>>>>> Also Spark 2.2 now requires JDK8 I believe.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ruslan Dautkhanov
>>>>>
>>>>> On Tue, Aug 1, 2017 at 6:26 PM, Benjamin Kim <bbuil...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Here is more.
>>>>>>
>>>>>> org.apache.zeppelin.interpreter.InterpreterException: WARNING:
>>>>>> User-defined SPARK_HOME
>>>>>> (/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2)
>>>>>> overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2).
>>>>>> WARNING: Running spark-class from user-defined location.
>>>>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>>>>> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
>>>>>> at
>>>>>> org.apache.spark.util.Utils$.getDefaultPropertiesFile(Utils.scala:2103)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
>>>>>> at scala.Option.getOrElse(Option.scala:120)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:124)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:110)
>>>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
>>>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>>
>>>

Re: Cloudera Spark 2.2

2017-08-02 Thread Benjamin Kim

Does this work with Zeppelin 0.7.1? We an error when setting SPARK_HOME in
zeppelin-env.sh to what you have below.

On Wed, Aug 2, 2017 at 3:24 PM Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> You don't have to use spark2-shell and spark2-submit to use Spark 2.
> That can be controled by setting SPARK_HOME using regular
> spark-submit/spark-shell.
>
> $ which spark-submit
> /usr/bin/spark-submit
> $ which spark-shell
> /usr/bin/spark-shell
>
> $ spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>   /_/
>
>
>
> $ export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
>
> $ spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.1.0.cloudera1
>   /_/
>
>
> spark-submit and spark-shell are just shell script wrappers.
>
>
>
> --
> Ruslan Dautkhanov
>
> On Wed, Aug 2, 2017 at 10:22 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> According to the Zeppelin documentation, Zeppelin 0.7.1 supports Spark
>> 2.1. But, I don't know if it supports Spark 2.2 or even 2.1 from Cloudera.
>> For some reason, Cloudera defaults to Spark 1.6 and so does the calls to
>> spark-shell and spark-submit. To force the use of Spark 2.x, the calls need
>> to be spark2-shell and spark2-submit. I wonder if this is causing the
>> problem. By the way, we are using Java8 corporate wide, and there seems to
>> be no problems using Zeppelin.
>>
>> Cheers,
>> Ben
>>
>> On Tue, Aug 1, 2017 at 7:05 PM Ruslan Dautkhanov <dautkha...@gmail.com>
>> wrote:
>>
>>> Might need to recompile Zeppelin with Scala 2.11?
>>> Also Spark 2.2 now requires JDK8 I believe.
>>>
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> On Tue, Aug 1, 2017 at 6:26 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>>> Here is more.
>>>>
>>>> org.apache.zeppelin.interpreter.InterpreterException: WARNING:
>>>> User-defined SPARK_HOME
>>>> (/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2)
>>>> overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2).
>>>> WARNING: Running spark-class from user-defined location.
>>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>>> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
>>>> at
>>>> org.apache.spark.util.Utils$.getDefaultPropertiesFile(Utils.scala:2103)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
>>>> at scala.Option.getOrElse(Option.scala:120)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:124)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:110)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>>
>>>> On Tue, Aug 1, 2017 at 5:24 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Then it is due to some classpath issue. I am not sure familiar with
>>>>> CDH, please check whether spark of CDH include hadoop jar with it.
>>>>>
>>>>>
>>>>> Benjamin Kim <bbuil...@gmail.com>于2017年8月2日周三 上午8:22写道：
>>>>>
>>>>>> Here is the error that was sent to me.
>>>>>>
>>>>>> org.apache.zeppelin.interpreter.InterpreterException: Exception in
>>>>>> thread "main" java.lang.NoClassDefFoundError:
>>>>>> org/apache/hadoop/fs/FSDataInputStream
>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>> org.apache.hadoop.fs.FSDataInputStream
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 1, 2017 at 5:20 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> By default, 0.7.1 doesn't support spark 2.2. But you can set 
>>>>>>> zeppelin.spark.enableSupportedVersionCheck
>>>>>>> in interpreter setting to disable the supported version check.
>>>>>>>
>>>>>>>
>>>>>>> Jeff Zhang <zjf...@gmail.com>于2017年8月2日周三 上午8:18写道：
>>>>>>>
>>>>>>>>
>>>>>>>> What's the error you see in log ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Benjamin Kim <bbuil...@gmail.com>于2017年8月2日周三 上午8:18写道：
>>>>>>>>
>>>>>>>>> Has anyone configured Zeppelin 0.7.1 for Cloudera's release of
>>>>>>>>> Spark 2.2? I can't get it to work. I downloaded the binary and set
>>>>>>>>> SPARK_HOME to /opt/cloudera/parcels/SPARK2/lib/spark2. I must be 
>>>>>>>>> missing
>>>>>>>>> something.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>
>>>
>

Geo Map Charting

2017-08-02 Thread Benjamin Kim

Anyone every try to chart density clusters or heat maps onto a geo map of
the earth in Zeppelin? Can it be done?

Cheers,
Ben

Re: Cloudera Spark 2.2

2017-08-02 Thread Benjamin Kim

According to the Zeppelin documentation, Zeppelin 0.7.1 supports Spark 2.1.
But, I don't know if it supports Spark 2.2 or even 2.1 from Cloudera. For
some reason, Cloudera defaults to Spark 1.6 and so does the calls to
spark-shell and spark-submit. To force the use of Spark 2.x, the calls need
to be spark2-shell and spark2-submit. I wonder if this is causing the
problem. By the way, we are using Java8 corporate wide, and there seems to
be no problems using Zeppelin.

Cheers,
Ben

On Tue, Aug 1, 2017 at 7:05 PM Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> Might need to recompile Zeppelin with Scala 2.11?
> Also Spark 2.2 now requires JDK8 I believe.
>
>
>
> --
> Ruslan Dautkhanov
>
> On Tue, Aug 1, 2017 at 6:26 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Here is more.
>>
>> org.apache.zeppelin.interpreter.InterpreterException: WARNING:
>> User-defined SPARK_HOME
>> (/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2)
>> overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2).
>> WARNING: Running spark-class from user-defined location.
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
>> at org.apache.spark.util.Utils$.getDefaultPropertiesFile(Utils.scala:2103)
>> at
>> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
>> at
>> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
>> at scala.Option.getOrElse(Option.scala:120)
>> at
>> org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:124)
>> at
>> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:110)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> Cheers,
>> Ben
>>
>>
>> On Tue, Aug 1, 2017 at 5:24 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>>
>>> Then it is due to some classpath issue. I am not sure familiar with CDH,
>>> please check whether spark of CDH include hadoop jar with it.
>>>
>>>
>>> Benjamin Kim <bbuil...@gmail.com>于2017年8月2日周三 上午8:22写道：
>>>
>>>> Here is the error that was sent to me.
>>>>
>>>> org.apache.zeppelin.interpreter.InterpreterException: Exception in
>>>> thread "main" java.lang.NoClassDefFoundError:
>>>> org/apache/hadoop/fs/FSDataInputStream
>>>> Caused by: java.lang.ClassNotFoundException:
>>>> org.apache.hadoop.fs.FSDataInputStream
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>>
>>>> On Tue, Aug 1, 2017 at 5:20 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>>>
>>>>> By default, 0.7.1 doesn't support spark 2.2. But you can set 
>>>>> zeppelin.spark.enableSupportedVersionCheck
>>>>> in interpreter setting to disable the supported version check.
>>>>>
>>>>>
>>>>> Jeff Zhang <zjf...@gmail.com>于2017年8月2日周三 上午8:18写道：
>>>>>
>>>>>>
>>>>>> What's the error you see in log ?
>>>>>>
>>>>>>
>>>>>> Benjamin Kim <bbuil...@gmail.com>于2017年8月2日周三 上午8:18写道：
>>>>>>
>>>>>>> Has anyone configured Zeppelin 0.7.1 for Cloudera's release of Spark
>>>>>>> 2.2? I can't get it to work. I downloaded the binary and set SPARK_HOME 
>>>>>>> to
>>>>>>> /opt/cloudera/parcels/SPARK2/lib/spark2. I must be missing something.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>>
>>>>>>
>

Re: Cloudera Spark 2.2

2017-08-01 Thread Benjamin Kim

Here is more.

org.apache.zeppelin.interpreter.InterpreterException: WARNING: User-defined
SPARK_HOME
(/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2)
overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2).
WARNING: Running spark-class from user-defined location.
Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at org.apache.spark.util.Utils$.getDefaultPropertiesFile(Utils.scala:2103)
at
org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
at
org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:124)
at
org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:110)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Cheers,
Ben

On Tue, Aug 1, 2017 at 5:24 PM Jeff Zhang <zjf...@gmail.com> wrote:

>
> Then it is due to some classpath issue. I am not sure familiar with CDH,
> please check whether spark of CDH include hadoop jar with it.
>
>
> Benjamin Kim <bbuil...@gmail.com>于2017年8月2日周三 上午8:22写道：
>
>> Here is the error that was sent to me.
>>
>> org.apache.zeppelin.interpreter.InterpreterException: Exception in thread
>> "main" java.lang.NoClassDefFoundError:
>> org/apache/hadoop/fs/FSDataInputStream
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.hadoop.fs.FSDataInputStream
>>
>> Cheers,
>> Ben
>>
>>
>> On Tue, Aug 1, 2017 at 5:20 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>>
>>> By default, 0.7.1 doesn't support spark 2.2. But you can set 
>>> zeppelin.spark.enableSupportedVersionCheck
>>> in interpreter setting to disable the supported version check.
>>>
>>>
>>> Jeff Zhang <zjf...@gmail.com>于2017年8月2日周三 上午8:18写道：
>>>
>>>>
>>>> What's the error you see in log ?
>>>>
>>>>
>>>> Benjamin Kim <bbuil...@gmail.com>于2017年8月2日周三 上午8:18写道：
>>>>
>>>>> Has anyone configured Zeppelin 0.7.1 for Cloudera's release of Spark
>>>>> 2.2? I can't get it to work. I downloaded the binary and set SPARK_HOME to
>>>>> /opt/cloudera/parcels/SPARK2/lib/spark2. I must be missing something.
>>>>>
>>>>> Cheers,
>>>>> Ben
>>>>>
>>>>

Re: Cloudera Spark 2.2

2017-08-01 Thread Benjamin Kim

Here is the error that was sent to me.

org.apache.zeppelin.interpreter.InterpreterException: Exception in thread
"main" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/FSDataInputStream
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.fs.FSDataInputStream

Cheers,
Ben


On Tue, Aug 1, 2017 at 5:20 PM Jeff Zhang <zjf...@gmail.com> wrote:

>
> By default, 0.7.1 doesn't support spark 2.2. But you can set 
> zeppelin.spark.enableSupportedVersionCheck
> in interpreter setting to disable the supported version check.
>
>
> Jeff Zhang <zjf...@gmail.com>于2017年8月2日周三 上午8:18写道：
>
>>
>> What's the error you see in log ?
>>
>>
>> Benjamin Kim <bbuil...@gmail.com>于2017年8月2日周三 上午8:18写道：
>>
>>> Has anyone configured Zeppelin 0.7.1 for Cloudera's release of Spark
>>> 2.2? I can't get it to work. I downloaded the binary and set SPARK_HOME to
>>> /opt/cloudera/parcels/SPARK2/lib/spark2. I must be missing something.
>>>
>>> Cheers,
>>> Ben
>>>
>>

Cloudera Spark 2.2

2017-08-01 Thread Benjamin Kim

Has anyone configured Zeppelin 0.7.1 for Cloudera's release of Spark 2.2? I
can't get it to work. I downloaded the binary and set SPARK_HOME to
/opt/cloudera/parcels/SPARK2/lib/spark2. I must be missing something.

Cheers,
Ben

Glue-like Functionality

2017-07-08 Thread Benjamin Kim

Has anyone seen AWS Glue? I was wondering if there is something similar going 
to be built into Spark Structured Streaming? I like the Data Catalog idea to 
store and track any data source/destination. It profiles the data to derive the 
scheme and data types. Also, it does some sort-of automated schema evolution 
when or if the schema changes. It leaves only the transformation logic to the 
ETL developer. I think some of this can enhance or simplify Structured 
Streaming. For example, AWS S3 can be catalogued as a Data Source; in 
Structured Streaming, Input DataFrame is created like a SQL view based off of 
the S3 Data Source; lastly, the Transform logic, if any, just manipulates the 
data going from the Input DataFrame to the Result DataFrame, which is another 
view based off of a catalogued Data Destination. This would relieve the ETL 
developer from caring about any Data Source or Destination. All server 
information, access credentials, data schemas, folder directory structures, 
file formats, and any other properties can be securely stored away with only a 
select few.

I'm just curious to know if anyone has thought the same thing.

Cheers,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Centos 7 Compatibility

2017-06-21 Thread Benjamin Kim

All,

I’m curious to know if Zeppelin will work with CentOS 7. I don’t see it in the 
list of OS’s supported.

Thanks,
Ben

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Benjamin Kim

Hi Bo,

+1 for your project. I come from the world of data warehouses, ETL, and 
reporting analytics. There are many individuals who do not know or want to do 
any coding. They are content with ANSI SQL and stick to it. ETL workflows are 
also done without any coding using a drag-and-drop user interface, such as 
Talend, SSIS, etc. There is a small amount of scripting involved but not too 
much. I looked at what you are trying to do, and I welcome it. This could open 
up Spark to the masses and shorten development times.

Cheers,
Ben


> On Jun 12, 2017, at 10:14 PM, bo yang  wrote:
> 
> Hi Aakash,
> 
> Thanks for your willing to help :) It will be great if I could get more 
> feedback on my project. For example, is there any other people feeling the 
> need of using a script to write Spark job easily? Also, I would explore 
> whether it is possible that the Spark project takes some work to build such a 
> script based high level DSL.
> 
> Best,
> Bo
> 
> 
> On Mon, Jun 12, 2017 at 12:14 PM, Aakash Basu  > wrote:
> Hey,
> 
> I work on Spark SQL and would pretty much be able to help you in this. Let me 
> know your requirement.
> 
> Thanks,
> Aakash.
> 
> On 12-Jun-2017 11:00 AM, "bo yang"  > wrote:
> Hi Guys,
> 
> I am writing a small open source project 
>  to use SQL Script to write Spark 
> Jobs. Want to see if there are other people interested to use or contribute 
> to this project.
> 
> The project is called UberScriptQuery 
> (https://github.com/uber/uberscriptquery 
> ). Sorry for the dumb name to avoid 
> conflict with many other names (Spark is registered trademark, thus I could 
> not use Spark in my project name).
> 
> In short, it is a high level SQL-like DSL (Domain Specific Language) on top 
> of Spark. People can use that DSL to write Spark jobs without worrying about 
> Spark internal details. Please check README 
>  in the project to get more details.
> 
> It will be great if I could get any feedback or suggestions!
> 
> Best,
> Bo
> 
>

Re: Spark 2.1 and Hive Metastore

2017-04-09 Thread Benjamin Kim

Dan,

Yes, you’re correct. I sent it to the wrong users’ group.

Thanks,
Ben


> On Apr 9, 2017, at 1:21 PM, Dan Burkert <danburk...@apache.org> wrote:
> 
> Hi Ben,
> 
> Was this meant for the Spark user list, or is there something specific to the 
> Spark/Kudu integration you are asking about?
> 
> - Dan
> 
> On Sun, Apr 9, 2017 at 11:13 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I’m curious about if and when Spark SQL will ever remove its dependency on 
> Hive Metastore. Now that Spark 2.1’s SparkSession has superseded the need for 
> HiveContext, are there plans for Spark to no longer use the Hive Metastore 
> service with a “SparkSchema” service with a PostgreSQL, MySQL, etc. DB 
> backend? Hive is growing long in the tooth, and it would be nice to retire it 
> someday.
> 
> Cheers,
> Ben
>

Spark 2.1 and Hive Metastore

2017-04-09 Thread Benjamin Kim

I’m curious about if and when Spark SQL will ever remove its dependency on Hive 
Metastore. Now that Spark 2.1’s SparkSession has superseded the need for 
HiveContext, are there plans for Spark to no longer use the Hive Metastore 
service with a “SparkSchema” service with a PostgreSQL, MySQL, etc. DB backend? 
Hive is growing long in the tooth, and it would be nice to retire it someday.

Cheers,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 2.1 and Hive Metastore

2017-04-09 Thread Benjamin Kim

I’m curious about if and when Spark SQL will ever remove its dependency on Hive 
Metastore. Now that Spark 2.1’s SparkSession has superseded the need for 
HiveContext, are there plans for Spark to no longer use the Hive Metastore 
service with a “SparkSchema” service with a PostgreSQL, MySQL, etc. DB backend? 
Hive is growing long in the tooth, and it would be nice to retire it someday.

Cheers,
Ben

Re: Spark on Kudu Roadmap

2017-04-09 Thread Benjamin Kim

Hi Mike,

Thanks for the link. I guess further, deeper Spark integration is slowly 
coming. But when, we will have to wait and see.

Cheers,
Ben
 

> On Mar 27, 2017, at 12:25 PM, Mike Percy <mpe...@apache.org> wrote:
> 
> Hi Ben,
> I don't really know so I'll let someone else more familiar with the Spark 
> integration chime in on that. However I searched the Kudu JIRA and I don't 
> see a tracking ticket filed on this (the closest thing I could find was 
> https://issues.apache.org/jira/browse/KUDU-1676 
> <https://issues.apache.org/jira/browse/KUDU-1676> ) so you may want to file a 
> JIRA to help track this feature.
> 
> Mike
> 
> 
> On Mon, Mar 27, 2017 at 11:55 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Mike,
> 
> I believe what we are looking for is this below. It is an often request use 
> case.
> 
> Anyone know if the Spark package will ever allow for creating tables in Spark 
> SQL?
> 
> Such as:
>CREATE EXTERNAL TABLE 
>USING org.apache.kudu.spark.kudu
>OPTIONS (Map("kudu.master" -> “", "kudu.table" -> 
> “table-name”));
> 
> In this way, plain SQL can be used to do DDL, DML statements whether in Spark 
> SQL code or using JDBC to interface with Spark SQL Thriftserver.
> 
> Thanks,
> Ben
> 
> 
> 
>> On Mar 27, 2017, at 11:01 AM, Mike Percy <mpe...@apache.org 
>> <mailto:mpe...@apache.org>> wrote:
>> 
>> Hi Ben,
>> Is there anything in particular you are looking for?
>> 
>> Thanks,
>> Mike
>> 
>> On Mon, Mar 27, 2017 at 9:48 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi,
>> 
>> Are there any plans for deeper integration with Spark especially Spark SQL? 
>> Is there a roadmap to look at, so I can know what to expect in the future?
>> 
>> Cheers,
>> Ben
>> 
> 
>

Re: Spark on Kudu Roadmap

2017-03-27 Thread Benjamin Kim

Hi Mike,

I believe what we are looking for is this below. It is an often request use 
case.

Anyone know if the Spark package will ever allow for creating tables in Spark 
SQL?

Such as:
   CREATE EXTERNAL TABLE 
   USING org.apache.kudu.spark.kudu
   OPTIONS (Map("kudu.master" -> “", "kudu.table" -> 
“table-name”));

In this way, plain SQL can be used to do DDL, DML statements whether in Spark 
SQL code or using JDBC to interface with Spark SQL Thriftserver.

Thanks,
Ben

> On Mar 27, 2017, at 11:01 AM, Mike Percy <mpe...@apache.org> wrote:
> 
> Hi Ben,
> Is there anything in particular you are looking for?
> 
> Thanks,
> Mike
> 
> On Mon, Mar 27, 2017 at 9:48 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi,
> 
> Are there any plans for deeper integration with Spark especially Spark SQL? 
> Is there a roadmap to look at, so I can know what to expect in the future?
> 
> Cheers,
> Ben
>

Spark on Kudu Roadmap

2017-03-27 Thread Benjamin Kim

Hi,

Are there any plans for deeper integration with Spark especially Spark SQL? Is 
there a roadmap to look at, so I can know what to expect in the future?

Cheers,
Ben

Re: Kudu on top of Alluxio

2017-03-25 Thread Benjamin Kim

Mike,

Thanks for the informative answer. I asked this question because I saw that 
Alluxio can be used to handle storage for HBase. Plus, we could keep our 
cluster size to a minimum and not need to add more nodes based on storage 
capacity. We would only need to size our clusters based on load (cores, memory, 
bandwidth) instead.

Cheers,
Ben


> On Mar 25, 2017, at 2:54 PM, Mike Percy <mpe...@apache.org> wrote:
> 
> Kudu currently relies on local storage on a POSIX file system. Right now 
> there is no support for S3, which would be interesting but is non-trivial in 
> certain ways (particularly if we wanted to rely on S3's replication and 
> disable Kudu's app-level replication).
> 
> I would suggest using only either EXT4 or XFS file systems for production 
> deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per machine 
> for the WAL and with the data disks on either SATA or SSD drives depending on 
> the workload. Anything else is untested AFAIK.
> 
> As for Alluxio, I haven't heard of people using it for permanent storage and 
> since Kudu has its own block cache I don't think it would really help with 
> caching. Also I don't recall Tachyon providing POSIX semantics.
> 
> Mike
> 
> Sent from my iPhone
> 
>> On Mar 25, 2017, at 9:50 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> Does anyone know of a way to use AWS S3 or 
>

Kudu on top of Alluxio

2017-03-25 Thread Benjamin Kim

Hi,

Does anyone know of a way to use AWS S3 or

Security Roadmap

2017-03-18 Thread Benjamin Kim

I’m curious as to what security features we can expect coming in the near and 
far future for Kudu. If there is some documentation for this, please let me 
know.

Cheers,
Ben

Login/Logout Problem

2017-03-01 Thread Benjamin Kim

We are running into problems where users login and staying logged in. When they 
try to run JDBC queries or even opening a notebook, they get flickering in the 
browser where the green color dot next to the username turns red, then back to 
green, then back to red, etc. When it stops doing that, then users are able to 
use the notebook finally, but when they do, they get an ERROR when clicking the 
run arrow often. Since these notebooks were migrated from Zeppelin 0.6 to 
Zeppelin 0.7, I suspect that there might be incompatibility issues.

Thanks,
Ben

Zeppelin Service Install

2017-03-01 Thread Benjamin Kim

Anyone have installed Zeppelin onto a CentOS/RedHat server and made it into a 
service? I can’t seem to find the instructions on how to do this.

Cheers,
Ben

Re: Get S3 Parquet File

2017-02-24 Thread Benjamin Kim

Gourav,

I’ll start experimenting with Spark 2.1 to see if this works.

Cheers,
Ben


> On Feb 24, 2017, at 5:46 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
> wrote:
> 
> Hi Benjamin,
> 
> First of all fetching data from S3 while writing a code in on premise system 
> is a very bad idea. You might want to first copy the data in to local HDFS 
> before running your code. Ofcourse this depends on the volume of data and 
> internet speed that you have.
> 
> The platform which makes your data at least 10 times faster is SPARK 2.1. And 
> trust me you do not want to be writing code which needs you to update it once 
> again in 6 months because newer versions of SPARK now find it deprecated.
> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> 
> On Fri, Feb 24, 2017 at 7:18 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Gourav,
> 
> My answers are below.
> 
> Cheers,
> Ben
> 
> 
>> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com 
>> <mailto:gourav.sengu...@gmail.com>> wrote:
>> 
>> Can I ask where are you running your CDH? Is it on premise or have you 
>> created a cluster for yourself in AWS? Our cluster in on premise in our data 
>> center.
>> 
>> Also I have really never seen use s3a before, that was used way long before 
>> when writing s3 files took a long time, but I think that you are reading it. 
>> 
>> Anyideas why you are not migrating to Spark 2.1, besides speed, there are 
>> lots of apis which are new and the existing ones are being deprecated. 
>> Therefore there is a very high chance that you are already working on code 
>> which is being deprecated by the SPARK community right now. We use CDH and 
>> upgrade with whatever Spark version they include, which is 1.6.0. We are 
>> waiting for the move to Spark 2.0/2.1.
>> 
>> And besides that would you not want to work on a platform which is at least 
>> 10 times faster What would that be?
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
>> file from AWS S3. We can read the schema and show some data when the file is 
>> loaded into a DataFrame, but when we try to do some operations, such as 
>> count, we get this error below.
>> 
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
>> credentials from any provider in the chain
>> at 
>> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>> at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>> at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>> at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>> at 
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>> at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>> at 
>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>> at 
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>> at 
>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>> at 
>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>> at 
>> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
>> at 
>> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>

Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim

Hi Gourav,

My answers are below.

Cheers,
Ben


> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com> 
> wrote:
> 
> Can I ask where are you running your CDH? Is it on premise or have you 
> created a cluster for yourself in AWS? Our cluster in on premise in our data 
> center.
> 
> Also I have really never seen use s3a before, that was used way long before 
> when writing s3 files took a long time, but I think that you are reading it. 
> 
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are 
> lots of apis which are new and the existing ones are being deprecated. 
> Therefore there is a very high chance that you are already working on code 
> which is being deprecated by the SPARK community right now. We use CDH and 
> upgrade with whatever Spark version they include, which is 1.6.0. We are 
> waiting for the move to Spark 2.0/2.1.
> 
> And besides that would you not want to work on a platform which is at least 
> 10 times faster What would that be?
> 
> Regards,
> Gourav Sengupta
> 
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
> file from AWS S3. We can read the schema and show some data when the file is 
> loaded into a DataFrame, but when we try to do some operations, such as 
> count, we get this error below.
> 
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
> credentials from any provider in the chain
> at 
> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
> at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
> at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> Can anyone help?
> 
> Cheers,
> Ben
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim

Aakash,

Here is a code snippet for the keys.

val accessKey = “---"
val secretKey = “---"

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key", accessKey)
hadoopConf.set("fs.s3a.secret.key", secretKey)
hadoopConf.set("spark.hadoop.fs.s3a.access.key",accessKey)
hadoopConf.set("spark.hadoop.fs.s3a.secret.key",secretKey)

val df = 
sqlContext.read.parquet("s3a://aps.optus/uc2/BI_URL_DATA_HLY_20170201_09.PARQUET.gz")
df.show
df.count

When we do the count, then the error happens.

Thanks,
Ben


> On Feb 23, 2017, at 10:31 AM, Aakash Basu <aakash.spark@gmail.com> wrote:
> 
> Hey,
> 
> Please recheck your access key and secret key being used to fetch the parquet 
> file. It seems to be a credential error. Either mismatch/load. If load, then 
> first use it directly in code and see if the issue resolves, then it can be 
> hidden and read from Input Params.
> 
> Thanks,
> Aakash.
> 
> 
> On 23-Feb-2017 11:54 PM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
> file from AWS S3. We can read the schema and show some data when the file is 
> loaded into a DataFrame, but when we try to do some operations, such as 
> count, we get this error below.
> 
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
> credentials from any provider in the chain
> at 
> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
> at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
> at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> Can anyone help?
> 
> Cheers,
> Ben
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Get S3 Parquet File

2017-02-23 Thread Benjamin Kim

We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
file from AWS S3. We can read the schema and show some data when the file is 
loaded into a DataFrame, but when we try to do some operations, such as count, 
we get this error below.

com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
credentials from any provider in the chain
at 
com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Can anyone help?

Cheers,
Ben


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Parquet Gzipped Files

2017-02-14 Thread Benjamin Kim

Jörn,

I agree with you, but the vendor is a little difficult to work with. For now, I 
will try to decompress it from S3 and save it plainly into HDFS. If someone 
already has this example, please let me know.

Cheers,
Ben


> On Feb 13, 2017, at 9:50 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Your vendor should use the parquet internal compression and not take a 
> parquet file and gzip it.
> 
>> On 13 Feb 2017, at 18:48, Benjamin Kim <bbuil...@gmail.com> wrote:
>> 
>> We are receiving files from an outside vendor who creates a Parquet data 
>> file and Gzips it before delivery. Does anyone know how to Gunzip the file 
>> in Spark and inject the Parquet data into a DataFrame? I thought using 
>> sc.textFile or sc.wholeTextFiles would automatically Gunzip the file, but 
>> I’m getting a decompression header error when trying to open the Parquet 
>> file.
>> 
>> Thanks,
>> Ben
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Parquet Gzipped Files

2017-02-13 Thread Benjamin Kim

We are receiving files from an outside vendor who creates a Parquet data file 
and Gzips it before delivery. Does anyone know how to Gunzip the file in Spark 
and inject the Parquet data into a DataFrame? I thought using sc.textFile or 
sc.wholeTextFiles would automatically Gunzip the file, but I’m getting a 
decompression header error when trying to open the Parquet file.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Remove dependence on HDFS

2017-02-11 Thread Benjamin Kim

Has anyone got some advice on how to remove the reliance on HDFS for storing 
persistent data. We have an on-premise Spark cluster. It seems like a waste of 
resources to keep adding nodes because of a lack of storage space only. I would 
rather add more powerful nodes due to the lack of processing power at a less 
frequent rate, than add less powerful nodes at a more frequent rate just to 
handle the ever growing data. Can anyone point me in the right direction? Is 
Alluxio a good solution? S3? I would like to hear your thoughts.

Cheers,
Ben 
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: HBase Spark

2017-02-03 Thread Benjamin Kim

Asher,

I found a profile for Spark 2.11 and removed it. Now, it brings in 2.10. I ran 
some code and got further. Now, I get this error below when I do a “df.show”.

java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:50)
at 
org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.log(HBaseFilter.scala:122)
at 
org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.buildFilters(HBaseFilter.scala:125)
at 
org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:59)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)

Thanks for all your help.

Cheers,
Ben


> On Feb 3, 2017, at 8:16 AM, Asher Krim <ak...@hubspot.com> wrote:
> 
> Did you check the actual maven dep tree? Something might be pulling in a 
> different version. Also, if you're seeing this locally, you might want to 
> check which version of the scala sdk your IDE is using
> 
> Asher Krim
> Senior Software Engineer
> 
> 
> On Thu, Feb 2, 2017 at 5:43 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Asher,
> 
> I modified the pom to be the same Spark (1.6.0), HBase (1.2.0), and Java 
> (1.8) version as our installation. The Scala (2.10.5) version is already the 
> same as ours. But I’m still getting the same error. Can you think of anything 
> else?
> 
> Cheers,
> Ben
> 
> 
>> On Feb 2, 2017, at 11:06 AM, Asher Krim <ak...@hubspot.com 
>> <mailto:ak...@hubspot.com>> wrote:
>> 
>> Ben,
>> 
>> That looks like a scala version mismatch. Have you checked your dep tree?
>> 
>> Asher Krim
>> Senior Software Engineer
>> 
>> 
>> On Thu, Feb 2, 2017 at 1:28 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Elek,
>> 
>> Can you give me some sample code? I can’t get mine to work.
>> 
>> import org.apache.spark.sql.{SQLContext, _}
>> import org.apache.spark.sql.execution.datasources.hbase._
>> import org.apache.spark.{SparkConf, SparkContext}
>> 
>> def cat = s"""{
>> |"table":{"namespace":"ben", "name":"dmp_test", 
>> "tableCoder":"PrimitiveType"},
>> |"rowkey":"key",
>> |"columns":{
>> |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
>> |"col1":{"cf":"d", "col":"google_gid", "type":"string"}
>> |}
>> |}""".stripMargin
>> 
>> import sqlContext.implicits._
>> 
>> def withCatalog(cat: String): DataFrame = {
>> sqlContext
>> .read
>> .options(Map(HBaseTableCatalog.tableCatalog->cat))
>> .format("org.apache.spark.sql.execution.datasources.hbase")
>> .load()
>> }
>> 
>> val df = withCatalog(cat)
>> df.show
>> 
>> It gives me this error.
>> 
>> java.lang.NoSuchMethodError: 
>> scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
>>  at 
>> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:232)
>>  at 
>> org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:77)
>>  at 
>> org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
>>  at 
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> 
>> If you can please help, I would be grateful.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Jan 31, 2017, at 1:02 PM, Marton, Elek <h...@anzix.net 
>>> <mailto:h...@anzix.net>> wrote:
>>> 
>>> 
>>> I tested this one with hbase 1.2.4:
>>> 
>>> https://github.com/hortonworks-spark/shc 
>>> <https://github.com/hortonworks-spark/shc>
>>> 
>>> Marton
>>> 
>>> On 01/31/2017 09:17 PM, Benjamin Kim wrote:
>>>> Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I 
>>>> tried to build it from source, but I cannot get it to work.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> 
>> 
>> 
> 
>

Re: HBase Spark

2017-02-03 Thread Benjamin Kim

Asher,

You’re right. I don’t see anything but 2.11 being pulled in. Do you know where 
I can change this?

Cheers,
Ben


> On Feb 3, 2017, at 10:50 AM, Asher Krim <ak...@hubspot.com> wrote:
> 
> Sorry for my persistence, but did you actually run "mvn dependency:tree 
> -Dverbose=true"? And did you see only scala 2.10.5 being pulled in?
> 
> On Fri, Feb 3, 2017 at 12:33 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Asher,
> 
> It’s still the same. Do you have any other ideas?
> 
> Cheers,
> Ben
> 
> 
>> On Feb 3, 2017, at 8:16 AM, Asher Krim <ak...@hubspot.com 
>> <mailto:ak...@hubspot.com>> wrote:
>> 
>> Did you check the actual maven dep tree? Something might be pulling in a 
>> different version. Also, if you're seeing this locally, you might want to 
>> check which version of the scala sdk your IDE is using
>> 
>> Asher Krim
>> Senior Software Engineer
>> 
>> 
>> On Thu, Feb 2, 2017 at 5:43 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi Asher,
>> 
>> I modified the pom to be the same Spark (1.6.0), HBase (1.2.0), and Java 
>> (1.8) version as our installation. The Scala (2.10.5) version is already the 
>> same as ours. But I’m still getting the same error. Can you think of 
>> anything else?
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Feb 2, 2017, at 11:06 AM, Asher Krim <ak...@hubspot.com 
>>> <mailto:ak...@hubspot.com>> wrote:
>>> 
>>> Ben,
>>> 
>>> That looks like a scala version mismatch. Have you checked your dep tree?
>>> 
>>> Asher Krim
>>> Senior Software Engineer
>>> 
>>> 
>>> On Thu, Feb 2, 2017 at 1:28 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Elek,
>>> 
>>> Can you give me some sample code? I can’t get mine to work.
>>> 
>>> import org.apache.spark.sql.{SQLContext, _}
>>> import org.apache.spark.sql.execution.datasources.hbase._
>>> import org.apache.spark.{SparkConf, SparkContext}
>>> 
>>> def cat = s"""{
>>> |"table":{"namespace":"ben", "name":"dmp_test", 
>>> "tableCoder":"PrimitiveType"},
>>> |"rowkey":"key",
>>> |"columns":{
>>> |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
>>> |"col1":{"cf":"d", "col":"google_gid", "type":"string"}
>>> |}
>>> |}""".stripMargin
>>> 
>>> import sqlContext.implicits._
>>> 
>>> def withCatalog(cat: String): DataFrame = {
>>> sqlContext
>>> .read
>>> .options(Map(HBaseTableCatalog.tableCatalog->cat))
>>> .format("org.apache.spark.sql.execution.datasources.hbase")
>>> .load()
>>> }
>>> 
>>> val df = withCatalog(cat)
>>> df.show
>>> 
>>> It gives me this error.
>>> 
>>> java.lang.NoSuchMethodError: 
>>> scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
>>> at 
>>> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:232)
>>> at 
>>> org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:77)
>>> at 
>>> org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
>>> at 
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>> 
>>> If you can please help, I would be grateful.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Jan 31, 2017, at 1:02 PM, Marton, Elek <h...@anzix.net 
>>>> <mailto:h...@anzix.net>> wrote:
>>>> 
>>>> 
>>>> I tested this one with hbase 1.2.4:
>>>> 
>>>> https://github.com/hortonworks-spark/shc 
>>>> <https://github.com/hortonworks-spark/shc>
>>>> 
>>>> Marton
>>>> 
>>>> On 01/31/2017 09:17 PM, Benjamin Kim wrote:
>>>>> Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I 
>>>>> tried to build it from source, but I cannot get it to work.
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> -
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>>> 
>>>> 
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: HBase Spark

2017-02-03 Thread Benjamin Kim

Asher,

It’s still the same. Do you have any other ideas?

Cheers,
Ben


> On Feb 3, 2017, at 8:16 AM, Asher Krim <ak...@hubspot.com> wrote:
> 
> Did you check the actual maven dep tree? Something might be pulling in a 
> different version. Also, if you're seeing this locally, you might want to 
> check which version of the scala sdk your IDE is using
> 
> Asher Krim
> Senior Software Engineer
> 
> 
> On Thu, Feb 2, 2017 at 5:43 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Asher,
> 
> I modified the pom to be the same Spark (1.6.0), HBase (1.2.0), and Java 
> (1.8) version as our installation. The Scala (2.10.5) version is already the 
> same as ours. But I’m still getting the same error. Can you think of anything 
> else?
> 
> Cheers,
> Ben
> 
> 
>> On Feb 2, 2017, at 11:06 AM, Asher Krim <ak...@hubspot.com 
>> <mailto:ak...@hubspot.com>> wrote:
>> 
>> Ben,
>> 
>> That looks like a scala version mismatch. Have you checked your dep tree?
>> 
>> Asher Krim
>> Senior Software Engineer
>> 
>> 
>> On Thu, Feb 2, 2017 at 1:28 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Elek,
>> 
>> Can you give me some sample code? I can’t get mine to work.
>> 
>> import org.apache.spark.sql.{SQLContext, _}
>> import org.apache.spark.sql.execution.datasources.hbase._
>> import org.apache.spark.{SparkConf, SparkContext}
>> 
>> def cat = s"""{
>> |"table":{"namespace":"ben", "name":"dmp_test", 
>> "tableCoder":"PrimitiveType"},
>> |"rowkey":"key",
>> |"columns":{
>> |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
>> |"col1":{"cf":"d", "col":"google_gid", "type":"string"}
>> |}
>> |}""".stripMargin
>> 
>> import sqlContext.implicits._
>> 
>> def withCatalog(cat: String): DataFrame = {
>> sqlContext
>> .read
>> .options(Map(HBaseTableCatalog.tableCatalog->cat))
>> .format("org.apache.spark.sql.execution.datasources.hbase")
>> .load()
>> }
>> 
>> val df = withCatalog(cat)
>> df.show
>> 
>> It gives me this error.
>> 
>> java.lang.NoSuchMethodError: 
>> scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
>>  at 
>> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:232)
>>  at 
>> org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:77)
>>  at 
>> org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
>>  at 
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> 
>> If you can please help, I would be grateful.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Jan 31, 2017, at 1:02 PM, Marton, Elek <h...@anzix.net 
>>> <mailto:h...@anzix.net>> wrote:
>>> 
>>> 
>>> I tested this one with hbase 1.2.4:
>>> 
>>> https://github.com/hortonworks-spark/shc 
>>> <https://github.com/hortonworks-spark/shc>
>>> 
>>> Marton
>>> 
>>> On 01/31/2017 09:17 PM, Benjamin Kim wrote:
>>>> Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I 
>>>> tried to build it from source, but I cannot get it to work.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> 
>> 
>> 
> 
>

Re: HBase Spark

2017-02-03 Thread Benjamin Kim

I'll clean up any .m2 or .ivy directories. And try again.

I ran this on our lab cluster for testing.

Cheers,
Ben


On Fri, Feb 3, 2017 at 8:16 AM Asher Krim <ak...@hubspot.com> wrote:

> Did you check the actual maven dep tree? Something might be pulling in a
> different version. Also, if you're seeing this locally, you might want to
> check which version of the scala sdk your IDE is using
>
> Asher Krim
> Senior Software Engineer
>
> On Thu, Feb 2, 2017 at 5:43 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
> Hi Asher,
>
> I modified the pom to be the same Spark (1.6.0), HBase (1.2.0), and Java
> (1.8) version as our installation. The Scala (2.10.5) version is already
> the same as ours. But I’m still getting the same error. Can you think of
> anything else?
>
> Cheers,
> Ben
>
>
> On Feb 2, 2017, at 11:06 AM, Asher Krim <ak...@hubspot.com> wrote:
>
> Ben,
>
> That looks like a scala version mismatch. Have you checked your dep tree?
>
> Asher Krim
> Senior Software Engineer
>
> On Thu, Feb 2, 2017 at 1:28 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
> Elek,
>
> Can you give me some sample code? I can’t get mine to work.
>
> import org.apache.spark.sql.{SQLContext, _}
> import org.apache.spark.sql.execution.datasources.hbase._
> import org.apache.spark.{SparkConf, SparkContext}
>
> def cat = s"""{
> |"table":{"namespace":"ben", "name":"dmp_test",
> "tableCoder":"PrimitiveType"},
> |"rowkey":"key",
> |"columns":{
> |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
> |"col1":{"cf":"d", "col":"google_gid", "type":"string"}
> |}
> |}""".stripMargin
>
> import sqlContext.implicits._
>
> def withCatalog(cat: String): DataFrame = {
> sqlContext
> .read
> .options(Map(HBaseTableCatalog.tableCatalog->cat))
> .format("org.apache.spark.sql.execution.datasources.hbase")
> .load()
> }
>
> val df = withCatalog(cat)
> df.show
>
>
> It gives me this error.
>
> java.lang.NoSuchMethodError:
> scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
> at
> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:232)
> at
> org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:77)
> at
> org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>
>
> If you can please help, I would be grateful.
>
> Cheers,
> Ben
>
>
> On Jan 31, 2017, at 1:02 PM, Marton, Elek <h...@anzix.net> wrote:
>
>
> I tested this one with hbase 1.2.4:
>
> https://github.com/hortonworks-spark/shc
>
> Marton
>
> On 01/31/2017 09:17 PM, Benjamin Kim wrote:
>
> Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I
> tried to build it from source, but I cannot get it to work.
>
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
>

Re: HBase Spark

2017-02-02 Thread Benjamin Kim

Hi Asher,

I modified the pom to be the same Spark (1.6.0), HBase (1.2.0), and Java (1.8) 
version as our installation. The Scala (2.10.5) version is already the same as 
ours. But I’m still getting the same error. Can you think of anything else?

Cheers,
Ben


> On Feb 2, 2017, at 11:06 AM, Asher Krim <ak...@hubspot.com> wrote:
> 
> Ben,
> 
> That looks like a scala version mismatch. Have you checked your dep tree?
> 
> Asher Krim
> Senior Software Engineer
> 
> 
> On Thu, Feb 2, 2017 at 1:28 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Elek,
> 
> Can you give me some sample code? I can’t get mine to work.
> 
> import org.apache.spark.sql.{SQLContext, _}
> import org.apache.spark.sql.execution.datasources.hbase._
> import org.apache.spark.{SparkConf, SparkContext}
> 
> def cat = s"""{
> |"table":{"namespace":"ben", "name":"dmp_test", 
> "tableCoder":"PrimitiveType"},
> |"rowkey":"key",
> |"columns":{
> |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
> |"col1":{"cf":"d", "col":"google_gid", "type":"string"}
> |}
> |}""".stripMargin
> 
> import sqlContext.implicits._
> 
> def withCatalog(cat: String): DataFrame = {
> sqlContext
> .read
> .options(Map(HBaseTableCatalog.tableCatalog->cat))
> .format("org.apache.spark.sql.execution.datasources.hbase")
> .load()
> }
> 
> val df = withCatalog(cat)
> df.show
> 
> It gives me this error.
> 
> java.lang.NoSuchMethodError: 
> scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
>   at 
> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:77)
>   at 
> org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> 
> If you can please help, I would be grateful.
> 
> Cheers,
> Ben
> 
> 
>> On Jan 31, 2017, at 1:02 PM, Marton, Elek <h...@anzix.net 
>> <mailto:h...@anzix.net>> wrote:
>> 
>> 
>> I tested this one with hbase 1.2.4:
>> 
>> https://github.com/hortonworks-spark/shc 
>> <https://github.com/hortonworks-spark/shc>
>> 
>> Marton
>> 
>> On 01/31/2017 09:17 PM, Benjamin Kim wrote:
>>> Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I 
>>> tried to build it from source, but I cannot get it to work.
>>> 
>>> Thanks,
>>> Ben
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
> 
>

Re: HBase Spark

2017-02-02 Thread Benjamin Kim

Elek,

Can you give me some sample code? I can’t get mine to work.

import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}

def cat = s"""{
|"table":{"namespace":"ben", "name":"dmp_test", 
"tableCoder":"PrimitiveType"},
|"rowkey":"key",
|"columns":{
|"col0":{"cf":"rowkey", "col":"key", "type":"string"},
|"col1":{"cf":"d", "col":"google_gid", "type":"string"}
|}
|}""".stripMargin

import sqlContext.implicits._

def withCatalog(cat: String): DataFrame = {
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}

val df = withCatalog(cat)
df.show

It gives me this error.

java.lang.NoSuchMethodError: 
scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
at 
org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:232)
at 
org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:77)
at 
org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)

If you can please help, I would be grateful.

Cheers,
Ben


> On Jan 31, 2017, at 1:02 PM, Marton, Elek <h...@anzix.net> wrote:
> 
> 
> I tested this one with hbase 1.2.4:
> 
> https://github.com/hortonworks-spark/shc
> 
> Marton
> 
> On 01/31/2017 09:17 PM, Benjamin Kim wrote:
>> Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I 
>> tried to build it from source, but I cannot get it to work.
>> 
>> Thanks,
>> Ben
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

Re: HBase Spark

2017-01-31 Thread Benjamin Kim

Elek,

If I cannot use the HBase Spark module, then I’ll give it a try.

Thanks,
Ben


> On Jan 31, 2017, at 1:02 PM, Marton, Elek <h...@anzix.net> wrote:
> 
> 
> I tested this one with hbase 1.2.4:
> 
> https://github.com/hortonworks-spark/shc
> 
> Marton
> 
> On 01/31/2017 09:17 PM, Benjamin Kim wrote:
>> Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I 
>> tried to build it from source, but I cannot get it to work.
>> 
>> Thanks,
>> Ben
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

HBase Spark

2017-01-31 Thread Benjamin Kim

Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I tried 
to build it from source, but I cannot get it to work.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: PostgreSQL JDBC Connections

2017-01-05 Thread Benjamin Kim

We are using the JDBC interpreter. The business analysts only know SQL and run 
ad-hoc queries for their report exports to CSV.

Cheers,
Ben


> On Jan 5, 2017, at 2:21 PM, t p <tauis2...@gmail.com> wrote:
> 
> Are you using JDBC or the PSQL interpreter? I had encountered something 
> similar while using the PSQL interpreter and I had to restart Zeppelin. 
> 
> My experience using PSQL (Postgresql, HAWK) was not as good as using 
> spark/scala wrappers (JDBC data source) to connect via JDBC and then register 
> temp tables. This approach allowed me to work with dynamic forms in a more 
> meaningful way e.g. use SQL results to create a new drop down to drive the 
> next page etc…
> 
> 
> 
>> On Jan 5, 2017, at 12:57 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>> 
>> We are getting “out of shared memory” errors when multiple users are running 
>> SQL queries against our PostgreSQL DB either simultaneously or throughout 
>> the day. When this happens, Zeppelin 0.6.0 becomes unresponsive for any more 
>> SQL queries. It looks like this is being caused by too many locks being 
>> taken and not released, transactions never closing, and/or connections never 
>> closing. Has anyone encountered Zeppelin 0.6.0 such an issue as this? If so, 
>> is there a solution for it?
>> 
>> Thanks,
>> Ben
>

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim

Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9

On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

Hi Benjamin,


As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.

I haven't tried this by myself but I remember I saw a JIRA in Parquet -
https://issues.apache.org/jira/browse/PARQUET-460

It seems parquet-tools allows merge small Parquet files into one.


Also, I believe there are command-line tools in Kite -
https://github.com/kite-sdk/kite

This might be useful.


Thanks!

2016-12-23 7:01 GMT+09:00 Benjamin Kim <bbuil...@gmail.com>:

Has anyone tried to merge *.gz.parquet files before? I'm trying to merge
them into 1 file after they are output from Spark. Doing a coalesce(1) on
the Spark cluster will not work. It just does not have the resources to do
it. I'm trying to do it using the commandline and not use Spark. I will use
this command in shell script. I tried "hdfs dfs -getmerge", but the file
becomes unreadable by Spark with gzip footer error.





Thanks,


Ben


-


To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim

Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t 
work, I’ll try Kite.

Cheers,
Ben


> On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> Hi Benjamin,
> 
> 
> As you might already know, I believe the Hadoop command automatically does 
> not merge the column-based format such as ORC or Parquet but just simply 
> concatenates them.
> 
> I haven't tried this by myself but I remember I saw a JIRA in Parquet - 
> https://issues.apache.org/jira/browse/PARQUET-460 
> <https://issues.apache.org/jira/browse/PARQUET-460>
> 
> It seems parquet-tools allows merge small Parquet files into one. 
> 
> 
> Also, I believe there are command-line tools in Kite - 
> https://github.com/kite-sdk/kite <https://github.com/kite-sdk/kite>
> 
> This might be useful.
> 
> 
> Thanks!
> 
> 2016-12-23 7:01 GMT+09:00 Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>>:
> Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
> into 1 file after they are output from Spark. Doing a coalesce(1) on the 
> Spark cluster will not work. It just does not have the resources to do it. 
> I'm trying to do it using the commandline and not use Spark. I will use this 
> command in shell script. I tried "hdfs dfs -getmerge", but the file becomes 
> unreadable by Spark with gzip footer error.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Merging Parquet Files

2016-12-22 Thread Benjamin Kim

Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark 
cluster will not work. It just does not have the resources to do it. I'm trying 
to do it using the commandline and not use Spark. I will use this command in 
shell script. I tried "hdfs dfs -getmerge", but the file becomes unreadable by 
Spark with gzip footer error.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Deep learning libraries for scala

2016-11-01 Thread Benjamin Kim

To add, I see that Databricks has been busy integrating deep learning more into 
their product and put out a new article about this.

https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html 
<https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html>

An interesting tidbit is at the bottom of the article mentioning TensorFrames.

https://github.com/databricks/tensorframes 
<https://github.com/databricks/tensorframes>

Seems like an interesting direction…

Cheers,
Ben


> On Oct 19, 2016, at 9:05 AM, janardhan shetty <janardhan...@gmail.com> wrote:
> 
> Agreed. But as it states deeper integration with (scala) is yet to be 
> developed. 
> Any thoughts on how to use tensorflow with scala ? Need to write wrappers I 
> think.
> 
> 
> On Oct 19, 2016 7:56 AM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> On that note, here is an article that Databricks made regarding using 
> Tensorflow in conjunction with Spark.
> 
> https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html
>  
> <https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html>
> 
> Cheers,
> Ben
> 
> 
>> On Oct 19, 2016, at 3:09 AM, Gourav Sengupta <gourav.sengu...@gmail.com 
>> <mailto:gourav.sengu...@gmail.com>> wrote:
>> 
>> while using Deep Learning you might want to stay as close to tensorflow as 
>> possible. There is very less translation loss, you get to access stable, 
>> scalable and tested libraries from the best brains in the industry and as 
>> far as Scala goes, it helps a lot to think about using the language as a 
>> tool to access algorithms in this instance unless you want to start 
>> developing algorithms from grounds up ( and in which case you might not 
>> require any libraries at all).
>> 
>> On Sat, Oct 1, 2016 at 3:30 AM, janardhan shetty <janardhan...@gmail.com 
>> <mailto:janardhan...@gmail.com>> wrote:
>> Hi,
>> 
>> Are there any good libraries which can be used for scala deep learning 
>> models ?
>> How can we integrate tensorflow with scala ML ?
>> 
>

Spark Streaming and Kinesis

2016-10-27 Thread Benjamin Kim

Has anyone worked with AWS Kinesis and retrieved data from it using Spark 
Streaming? I am having issues where it’s returning no data. I can connect to 
the Kinesis stream and describe using Spark. Is there something I’m missing? 
Are there specific IAM security settings needed? I just simply followed the 
Word Count ASL example. When it didn’t work, I even tried to run the code 
independently in Spark shell in yarn-client mode by hardcoding the arguments. 
Still, there was no data even with the setting InitialPositionInStream.LATEST 
changed to InitialPositionInStream.TRIM_HORIZON.

If anyone can help, I would truly appreciate it.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Deep learning libraries for scala

2016-10-19 Thread Benjamin Kim

On that note, here is an article that Databricks made regarding using 
Tensorflow in conjunction with Spark.

https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

Cheers,
Ben


> On Oct 19, 2016, at 3:09 AM, Gourav Sengupta  
> wrote:
> 
> while using Deep Learning you might want to stay as close to tensorflow as 
> possible. There is very less translation loss, you get to access stable, 
> scalable and tested libraries from the best brains in the industry and as far 
> as Scala goes, it helps a lot to think about using the language as a tool to 
> access algorithms in this instance unless you want to start developing 
> algorithms from grounds up ( and in which case you might not require any 
> libraries at all).
> 
> On Sat, Oct 1, 2016 at 3:30 AM, janardhan shetty  > wrote:
> Hi,
> 
> Are there any good libraries which can be used for scala deep learning models 
> ?
> How can we integrate tensorflow with scala ML ?
>

JDBC Connections

2016-10-18 Thread Benjamin Kim

We are using Zeppelin 0.6.0 as a self-service for our clients to query our 
PostgreSQL databases. We are noticing that the connections are not closing 
after each one of them are done. What is the normal operating procedure to have 
these connections close when idle? Our scope for the JDBC interpreter is 
“shared”, which I thought would make 1 connection for all notebooks. It would 
seem that I am wrong. Anyone have any ideas on what would help?

Thanks,
Ben

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim

This will give me an opportunity to start using Structured Streaming. Then, I 
can try adding more functionality. If all goes well, then we could transition 
off of HBase to a more in-memory data solution that can “spill-over” data for 
us.

> On Oct 17, 2016, at 11:53 AM, vincent gromakowski 
> <vincent.gromakow...@gmail.com> wrote:
> 
> Instead of (or additionally to) saving results somewhere, you just start a 
> thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
> now). That means you can implement any logic (and maybe use structured 
> streaming) to expose your data. Today using the thriftserver means reading 
> data from the persistent store every query, so if the data modeling doesn't 
> fit the query it can be quite long.  What you generally do in a common spark 
> job is to load the data and cache spark table in a in-memory columnar table 
> which is quite efficient for any kind of query, the counterpart is that the 
> cache isn't updated you have to implement a reload mechanism, and this 
> solution isn't available using the thriftserver.
> What I propose is to mix the two world: periodically/delta load data in spark 
> table cache and expose it through the thriftserver. But you have to implement 
> the loading logic, it can be very simple to very complex depending on your 
> needs.
> 
> 
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>>:
> Is this technique similar to what Kinesis is offering or what Structured 
> Streaming is going to have eventually?
> 
> Just curious.
> 
> Cheers,
> Ben
> 
>  
>> On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
>> <vincent.gromakow...@gmail.com <mailto:vincent.gromakow...@gmail.com>> wrote:
>> 
>> I would suggest to code your own Spark thriftserver which seems to be very 
>> easy.
>> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
>>  
>> <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server>
>> 
>> I am starting to test it. The big advantage is that you can implement any 
>> logic because it's a spark job and then start a thrift server on temporary 
>> table. For example you can query a micro batch rdd from a kafka stream, or 
>> pre load some tables and implement a rolling cache to periodically update 
>> the spark in memory tables with persistent store...
>> It's not part of the public API and I don't know yet what are the issues 
>> doing this but I think Spark community should look at this path: making the 
>> thriftserver be instantiable in any spark job.
>> 
>> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com 
>> <mailto:msegel_had...@hotmail.com>>:
>> Guys, 
>> Sorry for jumping in late to the game… 
>> 
>> If memory serves (which may not be a good thing…) :
>> 
>> You can use HiveServer2 as a connection point to HBase.  
>> While this doesn’t perform well, its probably the cleanest solution. 
>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>> 
>> 
>> The issue is that you’re trying to make HBase, a key/value object store, a 
>> Relational Engine… its not. 
>> 
>> There are some considerations which make HBase not ideal for all use cases 
>> and you may find better performance with Parquet files. 
>> 
>> One thing missing is the use of secondary indexing and query optimizations 
>> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
>> performance will vary. 
>> 
>> With respect to Tableau… their entire interface in to the big data world 
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
>> part of your solution, you’re DOA w respect to Tableau. 
>> 
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
>> Apache project) 
>> 
>> 
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Thanks for all the suggestions. It would seem you guys are right about the 
>>> Tableau side of things. The reports don’t need to be real-time, and they 
>>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>>> 
>>> I originally thought that we needed two-way data retrieval from the DMP 
>>> HBase for ID generation, but after further investigation into the use-case 
>>> and architecture, the ID generation needs to happen

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim

Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben

 
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
> <vincent.gromakow...@gmail.com> wrote:
> 
> I would suggest to code your own Spark thriftserver which seems to be very 
> easy.
> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
>  
> <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server>
> 
> I am starting to test it. The big advantage is that you can implement any 
> logic because it's a spark job and then start a thrift server on temporary 
> table. For example you can query a micro batch rdd from a kafka stream, or 
> pre load some tables and implement a rolling cache to periodically update the 
> spark in memory tables with persistent store...
> It's not part of the public API and I don't know yet what are the issues 
> doing this but I think Spark community should look at this path: making the 
> thriftserver be instantiable in any spark job.
> 
> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>>:
> Guys, 
> Sorry for jumping in late to the game… 
> 
> If memory serves (which may not be a good thing…) :
> 
> You can use HiveServer2 as a connection point to HBase.  
> While this doesn’t perform well, its probably the cleanest solution. 
> I’m not keen on Phoenix… wouldn’t recommend it…. 
> 
> 
> The issue is that you’re trying to make HBase, a key/value object store, a 
> Relational Engine… its not. 
> 
> There are some considerations which make HBase not ideal for all use cases 
> and you may find better performance with Parquet files. 
> 
> One thing missing is the use of secondary indexing and query optimizations 
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
> performance will vary. 
> 
> With respect to Tableau… their entire interface in to the big data world 
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
> part of your solution, you’re DOA w respect to Tableau. 
> 
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
> Apache project) 
> 
> 
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>> Thanks for all the suggestions. It would seem you guys are right about the 
>> Tableau side of things. The reports don’t need to be real-time, and they 
>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>> 
>> I originally thought that we needed two-way data retrieval from the DMP 
>> HBase for ID generation, but after further investigation into the use-case 
>> and architecture, the ID generation needs to happen local to the Ad Servers 
>> where we generate a unique ID and store it in a ID linking table. Even 
>> better, many of the 3rd party services supply this ID. So, data only needs 
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC 
>> required. This is also goes for the REST Endpoints. 3rd party services will 
>> hit ours to update our data with no need to read from our data. And, when we 
>> want to update their data, we will hit theirs to update their data using a 
>> triggered job.
>> 
>> This al boils down to just integrating with Kafka.
>> 
>> Once again, thanks for all the help.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> 
>>> please keep also in mind that Tableau Server has the capabilities to store 
>>> data in-memory and refresh only when needed the in-memory data. This means 
>>> you can import it from any source and let your users work only on the 
>>> in-memory data in Tableau Server.
>>> 
>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
>>> already a good alternative. However, you should check if it contains a 
>>> recent version of Hbase and Phoenix. That being said, I just wonder what is 
>>> the dataflow, data model and the analysis you plan to do. Maybe there are 
>>> completely different solutions possible. Especially these single inserts, 
>>> upserts etc. should be avoided

Re: Inserting New Primary Keys

2016-10-10 Thread Benjamin Kim

Jean,

I see your point. For the incremental data, which is very small, I should make 
sure that the PARTITION BY in the OVER(PARTITION BY ...) is left out so that 
all the data will be in one partition when assigned a row number. The query 
below should avoid any problems.

“SELECT ROW_NUMBER() OVER() + b.id_max AS id, a.* FROM source a CROSS JOIN 
(SELECT COALESCE(MAX(id),0) AS id_max FROM tmp_destination) b”.

But initially, I’ll use the monotonicallyIncreasingId function when I first 
load the data.

Thanks,
Ben


> On Oct 10, 2016, at 8:36 AM, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> Is there only one process adding rows? because this seems a little risky if 
> you have multiple threads doing that… 
> 
>> On Oct 8, 2016, at 1:43 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>> Mich,
>> 
>> After much searching, I found and am trying to use “SELECT ROW_NUMBER() 
>> OVER() + b.id_max AS id, a.* FROM source a CROSS JOIN (SELECT 
>> COALESCE(MAX(id),0) AS id_max FROM tmp_destination) b”. I think this should 
>> do it.
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Oct 8, 2016, at 9:48 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> can you get the max value from the current  table and start from MAX(ID) + 
>>> 1 assuming it is a numeric value (it should be)?
>>> 
>>> HTH
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> On 8 October 2016 at 17:42, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> I have a table with data already in it that has primary keys generated by 
>>> the function monotonicallyIncreasingId. Now, I want to insert more data 
>>> into it with primary keys that will auto-increment from where the existing 
>>> data left off. How would I do this? There is no argument I can pass into 
>>> the function monotonicallyIncreasingId to seed it.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> 
>>> 
>> 
>

Inserting New Primary Keys

2016-10-08 Thread Benjamin Kim

I have a table with data already in it that has primary keys generated by the 
function monotonicallyIncreasingId. Now, I want to insert more data into it 
with primary keys that will auto-increment from where the existing data left 
off. How would I do this? There is no argument I can pass into the function 
monotonicallyIncreasingId to seed it.

Thanks,
Ben


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Kudu Command Line Client

2016-10-07 Thread Benjamin Kim

Todd,

That works.

Thanks,
Ben

> On Oct 7, 2016, at 5:03 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Fri, Oct 7, 2016 at 5:01 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I’m trying to use:
> 
> kudu table list
> 
> I get:
> 
> Invalid argument: must provide master_addresses
> 
> I do the help:
> 
> Usage: 
> /opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/bin/../lib/kudu/bin/kudu 
>  []
> 
>  can be one of the following:
>  cluster   Operate on a Kudu cluster
>   fs   Operate on a local Kudu filesystem
>local_replica   Operate on local Kudu replicas via the local filesystem
>   master   Operate on a Kudu Master
>  pbc   Operate on PBC (protobuf container) files
>   remote_replica   Operate on replicas on a Kudu Tablet Server
>table   Operate on Kudu tables
>   tablet   Operate on remote Kudu tablets
>  tserver   Operate on a Kudu Tablet Server
>  wal   Operate on WAL (write-ahead log) files
> 
> I don’t see how to add the master_addresses.
> 
> If you run "kudu table list --help" it should give you:
> 
> [todd@ve0120 ~]$ kudu table list --help
> Usage: 
> 
> /opt/cloudera/parcels/KUDU-1.1.0-1.kudu1.1.0.p0.685/bin/../lib/kudu/bin/kudu 
> table list  [-list_tablets]
> 
> List all tables
> 
> master_addresses (Comma-separated list of Kudu Master addresses where each
>   address is of form 'hostname:port') type: string default: ""
> 
> -list_tablets (Include tablet and replica UUIDs in the output) type: bool
>   default: false
> 
> 
> so the command should be "kudu table list localhost" if you're running on the 
> same node as a master.
> 
> We should probably print the short usage info on any error so that this is 
> more obvious.
> 
> -Todd
>  
> 
> Thanks,
> Ben
> 
> 
>> On Oct 7, 2016, at 4:13 PM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> Hey Ben,
>> 
>> Which command are you using? try adding --help, and it should give you a 
>> usage statement.
>> 
>> -Todd
>> 
>> On Fri, Oct 7, 2016 at 4:12 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Does anyone know how to use the new Kudu command line client? It used to be 
>> kudu-admin, but that is no more. I keep being asked for the 
>> master_addresses. I tried different combinations to no avail. Can someone 
>> direct me to the documentation for it?
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-01 Thread Benjamin Kim

Mich,

I know up until CDH 5.4 we had to add the HTrace jar to the classpath to make 
it work using the command below. But after upgrading to CDH 5.7, it became 
unnecessary.

echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> 
/etc/spark/conf/classpath.txt

Hope this helps.

Cheers,
Ben


> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh  wrote:
> 
> Trying bulk load using Hfiles in Spark as below example:
> 
> import org.apache.spark._
> import org.apache.spark.rdd.NewHadoopRDD
> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
> import org.apache.hadoop.hbase.client.HBaseAdmin
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.hbase.HColumnDescriptor
> import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.client.Put;
> import org.apache.hadoop.hbase.client.HTable;
> import org.apache.hadoop.hbase.mapred.TableOutputFormat
> import org.apache.hadoop.mapred.JobConf
> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
> import org.apache.hadoop.mapreduce.Job
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
> import org.apache.hadoop.hbase.KeyValue
> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
> 
> So far no issues.
> 
> Then I do
> 
> val conf = HBaseConfiguration.create()
> conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hbase-default.xml, hbase-site.xml
> val tableName = "testTable"
> tableName: String = testTable
> 
> But this one fails:
> 
> scala> val table = new HTable(conf, tableName)
> java.io.IOException: java.lang.reflect.InvocationTargetException
>   at 
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:431)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:302)
>   at org.apache.hadoop.hbase.client.HTable.(HTable.java:185)
>   at org.apache.hadoop.hbase.client.HTable.(HTable.java:151)
>   ... 52 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
>   ... 57 more
> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:216)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>   at 
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:648)
>   ... 62 more
> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
> 
> I have got all the jar files in spark-defaults.conf
> 
> spark.driver.extraClassPath  
> /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
> spark.executor.extraClassPath
> /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
> 
> 
> and also in Spark shell where I test the code
> 
>  --jars 
> /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar'
> 
> So any ideas will be appreciated.
> 
>

Re: Spark on Kudu

2016-09-20 Thread Benjamin Kim

Thanks!

> On Sep 20, 2016, at 3:02 PM, Jordan Birdsell <jordantbirds...@gmail.com> 
> wrote:
> 
> http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark 
> <http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark>
> 
> On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I see that the API has changed a bit so my old code doesn’t work anymore. Can 
> someone direct me to some code samples?
> 
> Thanks,
> Ben
> 
> 
>> On Sep 20, 2016, at 1:44 PM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Now that Kudu 1.0.0 is officially out and ready for production use, where do 
>> we find the spark connector jar for this release?
>> 
>> 
>> It's available in the official ASF maven repository:  
>> https://repository.apache.org/#nexus-search;quick~kudu-spark 
>> <https://repository.apache.org/#nexus-search;quick~kudu-spark>
>> 
>> 
>>   org.apache.kudu
>>   kudu-spark_2.10
>>   1.0.0
>> 
>> 
>> 
>> -Todd
>>  
>> 
>> 
>>> On Jun 17, 2016, at 11:08 AM, Dan Burkert <d...@cloudera.com 
>>> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Hi Ben,
>>> 
>>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I 
>>> do not think we support that at this point.  I haven't looked deeply into 
>>> it, but we may hit issues specifying Kudu-specific options (partitioning, 
>>> column encoding, etc.).  Probably issues that can be worked through 
>>> eventually, though.  If you are interested in contributing to Kudu, this is 
>>> an area that could obviously use improvement!  Most or all of our Spark 
>>> features have been completely community driven to date.
>>>  
>>> I am assuming that more Spark support along with semantic changes below 
>>> will be incorporated into Kudu 0.9.1.
>>> 
>>> As a rule we do not release new features in patch releases, but the good 
>>> news is that we are releasing regularly, and our next scheduled release is 
>>> for the August timeframe (see JD's roadmap 
>>> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>>>  email about what we are aiming to include).  Also, Cloudera does publish 
>>> snapshot versions of the Spark connector here 
>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so 
>>> the jars are available if you don't mind using snapshots.
>>>  
>>> Anyone know of a better way to make unique primary keys other than using 
>>> UUID to make every row unique if there is no unique column (or combination 
>>> thereof) to use.
>>> 
>>> Not that I know of.  In general it's pretty rare to have a dataset without 
>>> a natural primary key (even if it's just all of the columns), but in those 
>>> cases UUID is a good solution.
>>>  
>>> This is what I am using. I know auto incrementing is coming down the line 
>>> (don’t know when), but is there a way to simulate this in Kudu using Spark 
>>> out of curiosity?
>>> 
>>> To my knowledge there is no plan to have auto increment in Kudu.  
>>> Distributed, consistent, auto incrementing counters is a difficult problem, 
>>> and I don't think there are any known solutions that would be fast enough 
>>> for Kudu (happy to be proven wrong, though!).
>>> 
>>> - Dan
>>>  
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com 
>>>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> I'm not sure exactly what the semantics will be, but at least one of them 
>>>> will be upsert.  These modes come from spark, and they were really 
>>>> designed for file-backed storage and not table storage.  We may want to do 
>>>> append = upsert, and overwrite = truncate + insert.  I think that may 
>>>> match the normal spark semantics more closely.
>>>> 
>>>> - Dan
>>>> 
>>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Dan,
>>>> 
>>&g

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim

Todd,

Thanks. I’ll look into those.

Cheers,
Ben


> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim

Todd,

Thanks. I’ll look into those.

Cheers,
Ben


> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim

This is awesome!!! Great!!!

Do you know if any improvements were also made to the Spark plugin jar?

Thanks,
Ben

> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team

Re: JDBC Very Slow

2016-09-16 Thread Benjamin Kim

I am testing this in spark-shell. I am following the Spark documentation by 
simply adding the PostgreSQL driver to the Spark Classpath.

SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell

Then, I run the code below to connect to the PostgreSQL database to query. This 
is when I have problems.

Thanks,
Ben


> On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet <phpap...@gmail.com> wrote:
> 
> Hi! Can you split init code with current comand? I thing it is main problem 
> in your code.
> 
> 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> написал:
> Has anyone using Spark 1.6.2 encountered very slow responses from pulling 
> data from PostgreSQL using JDBC? I can get to the table and see the schema, 
> but when I do a show, it takes very long or keeps timing out.
> 
> The code is simple.
> 
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> 
> "jdbc:postgresql://dbserver:port/database?user=user=password",
>"dbtable" -> “schema.table")).load()
> 
> jdbcDF.show
> 
> If anyone can help, please let me know.
> 
> Thanks,
> Ben
>

JDBC Very Slow

2016-09-16 Thread Benjamin Kim

Has anyone using Spark 1.6.2 encountered very slow responses from pulling data 
from PostgreSQL using JDBC? I can get to the table and see the schema, but when 
I do a show, it takes very long or keeps timing out.

The code is simple.

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> 
"jdbc:postgresql://dbserver:port/database?user=user=password",
   "dbtable" -> “schema.table")).load()

jdbcDF.show

If anyone can help, please let me know.

Thanks,
Ben

Re: Using Spark SQL to Create JDBC Tables

2016-09-13 Thread Benjamin Kim

Thank you for the idea. I will look for a PostgreSQL Serde for Hive. But, if 
you don’t mind me asking, how did you install the Oracle Serde?

Cheers,
Ben


> On Sep 13, 2016, at 7:12 PM, ayan guha <guha.a...@gmail.com> wrote:
> 
> One option is have Hive as the central point of exposing data ie create hive 
> tables which "point to" any other DB. i know Oracle provides there own Serde 
> for hive. Not sure about PG though.
> 
> Once tables are created in hive, STS will automatically see it. 
> 
> On Wed, Sep 14, 2016 at 11:08 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Has anyone created tables using Spark SQL that directly connect to a JDBC 
> data source such as PostgreSQL? I would like to use Spark SQL Thriftserver to 
> access and query remote PostgreSQL tables. In this way, we can centralize 
> data access to Spark SQL tables along with PostgreSQL making it very 
> convenient for users. They would not know or care where the data is 
> physically located anymore.
> 
> By the way, our users only know SQL.
> 
> If anyone has a better suggestion, then please let me know too.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha

Using Spark SQL to Create JDBC Tables

2016-09-13 Thread Benjamin Kim

Has anyone created tables using Spark SQL that directly connect to a JDBC data 
source such as PostgreSQL? I would like to use Spark SQL Thriftserver to access 
and query remote PostgreSQL tables. In this way, we can centralize data access 
to Spark SQL tables along with PostgreSQL making it very convenient for users. 
They would not know or care where the data is physically located anymore.

By the way, our users only know SQL.

If anyone has a better suggestion, then please let me know too.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim

Mich,

It sounds like that there would be no harm in changing then. Are you saying 
that using STS would still use MapReduce to run the SQL statements? What our 
users are doing in our CDH 5.7.2 installation is changing the execution engine 
to Spark when connected to HiveServer2 to get faster results. Would they still 
have to do this using STS? Lastly, we are seeing zombie YARN jobs left behind 
even after a user disconnects. Are you seeing this happen with STS? If not, 
then this would be even better.

Thanks for your fast reply.

Cheers,
Ben

> On Sep 13, 2016, at 3:15 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Hi,
> 
> Spark Thrift server (STS) still uses hive thrift server. If you look at 
> $SPARK_HOME/sbin/start-thriftserver.sh you will see (mine is Spark 2)
> 
> function usage {
>   echo "Usage: ./sbin/start-thriftserver [options] [thrift server options]"
>   pattern="usage"
>   pattern+="\|Spark assembly has been built with Hive"
>   pattern+="\|NOTE: SPARK_PREPEND_CLASSES is set"
>   pattern+="\|Spark Command: "
>   pattern+="\|==="
>   pattern+="\|--help"
> 
> 
> Indeed when you start STS, you pass hiveconf parameter to it
> 
> ${SPARK_HOME}/sbin/start-thriftserver.sh \
> --master  \
> --hiveconf hive.server2.thrift.port=10055 \
> 
> and STS bypasses Spark optimiser and uses Hive optimizer and execution 
> engine. You will see this in hive.log file
> 
> So I don't think it is going to give you much difference. Unless they have 
> recently changed the design of STS.
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 13 September 2016 at 22:32, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Does anyone have any thoughts about using Spark SQL Thriftserver in Spark 
> 1.6.2 instead of HiveServer2? We are considering abandoning HiveServer2 for 
> it. Some advice and gotcha’s would be nice to know.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim

Does anyone have any thoughts about using Spark SQL Thriftserver in Spark 1.6.2 
instead of HiveServer2? We are considering abandoning HiveServer2 for it. Some 
advice and gotcha’s would be nice to know.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Metrics: custom source/sink configurations not getting recognized

2016-09-07 Thread Benjamin Kim

We use Graphite/Grafana for custom metrics. We found Spark’s metrics not to be 
customizable. So, we write directly using Graphite’s API, which was very easy 
to do using Java’s socket library in Scala. It works great for us, and we are 
going one step further using Sensu to alert us if there is an anomaly in the 
metrics beyond the norm.

Hope this helps.

Cheers,
Ben


> On Sep 6, 2016, at 9:52 PM, map reduced  wrote:
> 
> Hi, anyone has any ideas please?
> 
> On Mon, Sep 5, 2016 at 8:30 PM, map reduced  > wrote:
> Hi,
> 
> I've written my custom metrics source/sink for my Spark streaming app and I 
> am trying to initialize it from metrics.properties - but that doesn't work 
> from executors. I don't have control on the machines in Spark cluster, so I 
> can't copy properties file in $SPARK_HOME/conf/ in the cluster. I have it in 
> the fat jar where my app lives, but by the time my fat jar is downloaded on 
> worker nodes in cluster, executors are already started and their Metrics 
> system is already initialized - thus not picking my file with custom source 
> configuration in it.
> 
> Following this post 
> ,
>  I've specified 'spark.files 
>  = 
> metrics.properties' and 'spark.metrics.conf=metrics.properties' but by the 
> time 'metrics.properties' is shipped to executors, their metric system is 
> already initialized.
> 
> If I initialize my own metrics system, it's picking up my file but then I'm 
> missing master/executor level metrics/properties (eg. 
> executor.sink.mySink.propName=myProp - can't read 'propName' from 'mySink') 
> since they are initialized 
> 
>  by Spark's metric system.
> 
> Is there a (programmatic) way to have 'metrics.properties' shipped before 
> executors initialize 
> 
>  ?
> 
> Here's my SO question 
> .
> 
> Thanks,
> 
> KP
> 
>

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim

I’m using Spark 1.6 and HBase 1.2. Have you got it to work using these versions?

> On Sep 3, 2016, at 12:49 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> I am trying to find a solution for this
> 
> ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.hbase.HBaseSerDe not found
> 
> I am using Spark 2 and Hive 2!
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 3 September 2016 at 20:31, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Mich,
> 
> I’m in the same boat. We can use Hive but not Spark.
> 
> Cheers,
> Ben
> 
>> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> You can create Hive external  tables on top of existing Hbase table using 
>> the property
>> 
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> 
>> Example
>> 
>> hive> show create table hbase_table;
>> OK
>> CREATE TABLE `hbase_table`(
>>   `key` int COMMENT '',
>>   `value1` string COMMENT '',
>>   `value2` int COMMENT '',
>>   `value3` int COMMENT '')
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
>> STORED BY
>>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES (
>>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>>   'serialization.format'='1')
>> TBLPROPERTIES (
>>   'transient_lastDdlTime'='1472370939')
>> 
>>  Then try to access this Hive table from Spark which is giving me grief at 
>> the moment :(
>> 
>> scala> HiveContext.sql("use test")
>> res9: org.apache.spark.sql.DataFrame = []
>> scala> val hbase_table= spark.table("hbase_table")
>> 16/09/02 23:31:07 ERROR log: error in initSerDe: 
>> java.lang.ClassNotFoundException Class 
>> org.apache.hadoop.hive.hbase.HBaseSerDe not found
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 September 2016 at 23:08, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com 
>> <mailto:mdkhajaasm...@gmail.com>> wrote:
>> Hi Kim,
>> 
>> I am also looking for same information. Just got the same requirement today.
>> 
>> Thanks,
>> Asmath
>> 
>> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I was wondering if anyone has tried to create Spark SQL tables on top of 
>> HBase tables so that data in HBase can be accessed using Spark Thriftserver 
>> with SQL statements? This is similar what can be done using Hive.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
>> 
> 
>

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim

Mich,

I’m in the same boat. We can use Hive but not Spark.

Cheers,
Ben

> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi,
> 
> You can create Hive external  tables on top of existing Hbase table using the 
> property
> 
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> 
> Example
> 
> hive> show create table hbase_table;
> OK
> CREATE TABLE `hbase_table`(
>   `key` int COMMENT '',
>   `value1` string COMMENT '',
>   `value2` int COMMENT '',
>   `value3` int COMMENT '')
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
> STORED BY
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'transient_lastDdlTime'='1472370939')
> 
>  Then try to access this Hive table from Spark which is giving me grief at 
> the moment :(
> 
> scala> HiveContext.sql("use test")
> res9: org.apache.spark.sql.DataFrame = []
> scala> val hbase_table= spark.table("hbase_table")
> 16/09/02 23:31:07 ERROR log: error in initSerDe: 
> java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.hbase.HBaseSerDe not found
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 September 2016 at 23:08, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com 
> <mailto:mdkhajaasm...@gmail.com>> wrote:
> Hi Kim,
> 
> I am also looking for same information. Just got the same requirement today.
> 
> Thanks,
> Asmath
> 
> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I was wondering if anyone has tried to create Spark SQL tables on top of 
> HBase tables so that data in HBase can be accessed using Spark Thriftserver 
> with SQL statements? This is similar what can be done using Hive.
> 
> Thanks,
> Ben
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
>

Spark SQL Tables on top of HBase Tables

2016-09-02 Thread Benjamin Kim

I was wondering if anyone has tried to create Spark SQL tables on top of HBase 
tables so that data in HBase can be accessed using Spark Thriftserver with SQL 
statements? This is similar what can be done using Hive.

Thanks,
Ben


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 1.6 Streaming with Checkpointing

2016-08-26 Thread Benjamin Kim

I am trying to implement checkpointing in my streaming application but I am 
getting a not serializable error. Has anyone encountered this? I am deploying 
this job in YARN clustered mode.

Here is a snippet of the main parts of the code.

object S3EventIngestion {
//create and setup streaming context
def createContext(
batchInterval: Integer, checkpointDirectory: String, awsS3BucketName: 
String, databaseName: String, tableName: String, partitionByColumnName: String
): StreamingContext = {

println("Creating new context")
val sparkConf = new SparkConf().setAppName("S3EventIngestion")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)

// Create the streaming context with batch interval
val ssc = new StreamingContext(sc, Seconds(batchInterval))

// Create a text file stream on an S3 bucket
val csv = ssc.textFileStream("s3a://" + awsS3BucketName + "/")

csv.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
// process data
}
})

ssc.checkpoint(checkpointDirectory)
ssc
}

def main(args: Array[String]) {
if (args.length != 6) {
System.err.println("Usage: S3EventIngestion  

")
System.exit(1)
}

// Get streaming context from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpoint,
() => createContext(interval, checkpoint, bucket, database, table, 
partitionBy))

//start streaming context
context.start()
context.awaitTermination()
}
}

Can someone help please?

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

HBase-Spark Module

2016-07-29 Thread Benjamin Kim

I would like to know if anyone has tried using the hbase-spark module? I tried 
to follow the examples in conjunction with CDH 5.8.0. I cannot find the 
HBaseTableCatalog class in the module or in any of the Spark jars. Can someone 
help?

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Pass Credentials through JDBC

2016-07-28 Thread Benjamin Kim

Thank you. I’ll take a look.


> On Jul 28, 2016, at 8:16 AM, Jongyoul Lee <jongy...@gmail.com> wrote:
> 
> You can find more information on 
> https://issues.apache.org/jira/browse/ZEPPELIN-1146 
> <https://issues.apache.org/jira/browse/ZEPPELIN-1146>
> 
> Hope this help,
> Jongyoul
> 
> On Fri, Jul 29, 2016 at 12:08 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Jonyoul,
> 
> How would I enter credentials with the current version of Zeppelin? Do you 
> know of a way to make it work now?
> 
> Thanks,
> Ben
> 
>> On Jul 28, 2016, at 8:06 AM, Jongyoul Lee <jongy...@gmail.com 
>> <mailto:jongy...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> In my plan, this is a next step after 
>> https://issues.apache.org/jira/browse/ZEPPELIN-1210 
>> <https://issues.apache.org/jira/browse/ZEPPELIN-1210>. But for now, there's 
>> no way to pass your credentials with hiding them. I hope that would be 
>> included in 0.7.0.
>> 
>> Regards,
>> Jongyoul
>> 
>> On Thu, Jul 28, 2016 at 11:22 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> How do I pass username and password to JDBC connections such as Phoenix and 
>> Hive that are my own? Can my credentials be passed from Shiro after logging 
>> in? Or do I have to set them at the Interpreter level without sharing them? 
>> I wish there was more information on this.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>> -- 
>> 이종열, Jongyoul Lee, 李宗烈
>> http://madeng.net <http://madeng.net/>
> 
> 
> 
> 
> -- 
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net <http://madeng.net/>

Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Benjamin Kim

It is included in Cloudera’s CDH 5.8.

> On Jul 22, 2016, at 6:13 PM, Mail.com  wrote:
> 
> Hbase Spark module will be available with Hbase 2.0. Is that out yet?
> 
>> On Jul 22, 2016, at 8:50 PM, Def_Os  wrote:
>> 
>> So it appears it should be possible to use HBase's new hbase-spark module, if
>> you follow this pattern:
>> https://hbase.apache.org/book.html#_sparksql_dataframes
>> 
>> Unfortunately, when I run my example from PySpark, I get the following
>> exception:
>> 
>> 
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o120.save.
>>> : java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource
>>> does not allow create table as select.
>>>   at scala.sys.package$.error(package.scala:27)
>>>   at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:259)
>>>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>>>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>   at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>   at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>   at java.lang.reflect.Method.invoke(Method.java:606)
>>>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>>>   at py4j.Gateway.invoke(Gateway.java:259)
>>>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>>>   at java.lang.Thread.run(Thread.java:745)
>> 
>> Even when I created the table in HBase first, it still failed.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-connect-HBase-and-Spark-using-Python-tp27372p27397.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Benjamin Kim

From what I read, there is no more Contexts.

"SparkContext, SQLContext, HiveContext merged into SparkSession"

I have not tested it, but I don’t know if it’s true.

Cheers,
Ben


> On Jul 18, 2016, at 8:37 AM, Koert Kuipers  wrote:
> 
> in my codebase i would like to gradually transition to SparkSession, so while 
> i start using SparkSession i also want a SQLContext to be available as before 
> (but with a deprecated warning when i use it). this should be easy since 
> SQLContext is now a wrapper for SparkSession.
> 
> so basically:
> val session = SparkSession.builder.set(..., ...).getOrCreate()
> val sqlc = new SQLContext(session)
> 
> however this doesnt work, the SQLContext constructor i am trying to use is 
> private. SparkSession.sqlContext is also private.
> 
> am i missing something?
> 
> a non-gradual switch is not very realistic in any significant codebase, and i 
> do not want to create SparkSession and SQLContext independendly (both from 
> same SparkContext) since that can only lead to confusion and inconsistent 
> settings.

Re: Performance Question

2016-07-18 Thread Benjamin Kim

Todd,

I upgraded, deleted the table and recreated it again because it was 
unaccessible, and re-introduced the downed tablet server after clearing out all 
kudu directories.

The Spark Streaming job is repopulating again.

Thanks,
Ben


> On Jul 18, 2016, at 10:32 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Mon, Jul 18, 2016 at 10:31 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> Thanks for the info. I was going to upgrade after the testing, but now, it 
> looks like I will have to do it earlier than expected.
> 
> I will do the upgrade, then resume.
> 
> OK, sounds good. The upgrade shouldn't invalidate any performance testing or 
> anything -- just fixes this important bug.
> 
> -Todd
> 
> 
>> On Jul 18, 2016, at 10:29 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> Hi Ben,
>> 
>> Any chance that you are running Kudu 0.9.0 instead of 0.9.1? There's a known 
>> serious bug in 0.9.0 which can cause this kind of corruption.
>> 
>> Assuming that you are running with replication count 3 this time, you should 
>> be able to move aside that tablet metadata file and start the server. It 
>> will recreate a new repaired replica automatically.
>> 
>> -Todd
>> 
>> On Mon, Jul 18, 2016 at 10:28 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> During my re-population of the Kudu table, I am getting this error trying to 
>> restart a tablet server after it went down. The job that populates this 
>> table has been running for over a week.
>> 
>> [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message 
>> of type "kudu.tablet.TabletSuperBlockPB" because it is missing required 
>> fields: rowsets[2324].columns[15].block
>> F0718 17:01:26.783571   468 tablet_server_main.cc:55] Check failed: _s.ok() 
>> Bad status: IO error: Could not init Tablet Manager: Failed to open tablet 
>> metadata for tablet: 24637ee6f3e5440181ce3f20b1b298ba: Failed to load tablet 
>> metadata for tablet id 24637ee6f3e5440181ce3f20b1b298ba: Could not load 
>> tablet metadata from 
>> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba: Unable to 
>> parse PB from path: 
>> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba
>> *** Check failure stack trace: ***
>> @   0x7d794d  google::LogMessage::Fail()
>> @   0x7d984d  google::LogMessage::SendToLog()
>> @   0x7d7489  google::LogMessage::Flush()
>> @   0x7da2ef  google::LogMessageFatal::~LogMessageFatal()
>> @   0x78172b  (unknown)
>> @   0x344d41ed5d  (unknown)
>> @   0x7811d1  (unknown)
>> 
>> Does anyone know what this means?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jul 11, 2016, at 10:47 AM, Todd Lipcon <t...@cloudera.com 
>>> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Todd,
>>> 
>>> I had it at one replica. Do I have to recreate?
>>> 
>>> We don't currently have the ability to "accept data loss" on a tablet (or 
>>> set of tablets). If the machine is gone for good, then currently the only 
>>> easy way to recover is to recreate the table. If this sounds really 
>>> painful, though, maybe we can work up some kind of tool you could use to 
>>> just recreate the missing tablets (with those rows lost).
>>> 
>>> -Todd
>>> 
>>>> On Jul 11, 2016, at 10:37 AM, Todd Lipcon <t...@cloudera.com 
>>>> <mailto:t...@cloudera.com>> wrote:
>>>> 
>>>> Hey Ben,
>>>> 
>>>> Is the table that you're querying replicated? Or was it created with only 
>>>> one replica per tablet?
>>>> 
>>>> -Todd
>>>> 
>>>> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim <b...@amobee.com 
>>>> <mailto:b...@amobee.com>> wrote:
>>>> Over the weekend, a tablet server went down. It’s not coming back up. So, 
>>>> I decommissioned it and removed it from the cluster. Then, I restarted 
>>>> Kudu because I was getting a timeout  exception trying to do counts on the 
>>>> table. Now, when I try again. I get the same error.
>>>> 
>>>> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 46

Re: Performance Question

2016-07-18 Thread Benjamin Kim

Todd,

Thanks for the info. I was going to upgrade after the testing, but now, it 
looks like I will have to do it earlier than expected.

I will do the upgrade, then resume.

Cheers,
Ben


> On Jul 18, 2016, at 10:29 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Hi Ben,
> 
> Any chance that you are running Kudu 0.9.0 instead of 0.9.1? There's a known 
> serious bug in 0.9.0 which can cause this kind of corruption.
> 
> Assuming that you are running with replication count 3 this time, you should 
> be able to move aside that tablet metadata file and start the server. It will 
> recreate a new repaired replica automatically.
> 
> -Todd
> 
> On Mon, Jul 18, 2016 at 10:28 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> During my re-population of the Kudu table, I am getting this error trying to 
> restart a tablet server after it went down. The job that populates this table 
> has been running for over a week.
> 
> [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message 
> of type "kudu.tablet.TabletSuperBlockPB" because it is missing required 
> fields: rowsets[2324].columns[15].block
> F0718 17:01:26.783571   468 tablet_server_main.cc:55] Check failed: _s.ok() 
> Bad status: IO error: Could not init Tablet Manager: Failed to open tablet 
> metadata for tablet: 24637ee6f3e5440181ce3f20b1b298ba: Failed to load tablet 
> metadata for tablet id 24637ee6f3e5440181ce3f20b1b298ba: Could not load 
> tablet metadata from 
> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba: Unable to 
> parse PB from path: 
> /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba
> *** Check failure stack trace: ***
> @   0x7d794d  google::LogMessage::Fail()
> @   0x7d984d  google::LogMessage::SendToLog()
> @   0x7d7489  google::LogMessage::Flush()
> @   0x7da2ef  google::LogMessageFatal::~LogMessageFatal()
> @   0x78172b  (unknown)
> @   0x344d41ed5d  (unknown)
> @   0x7811d1  (unknown)
> 
> Does anyone know what this means?
> 
> Thanks,
> Ben
> 
> 
>> On Jul 11, 2016, at 10:47 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> I had it at one replica. Do I have to recreate?
>> 
>> We don't currently have the ability to "accept data loss" on a tablet (or 
>> set of tablets). If the machine is gone for good, then currently the only 
>> easy way to recover is to recreate the table. If this sounds really painful, 
>> though, maybe we can work up some kind of tool you could use to just 
>> recreate the missing tablets (with those rows lost).
>> 
>> -Todd
>> 
>>> On Jul 11, 2016, at 10:37 AM, Todd Lipcon <t...@cloudera.com 
>>> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> Hey Ben,
>>> 
>>> Is the table that you're querying replicated? Or was it created with only 
>>> one replica per tablet?
>>> 
>>> -Todd
>>> 
>>> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim <b...@amobee.com 
>>> <mailto:b...@amobee.com>> wrote:
>>> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
>>> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
>>> because I was getting a timeout  exception trying to do counts on the 
>>> table. Now, when I try again. I get the same error.
>>> 
>>> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 
>>> 0.0 (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
>>> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
>>> com.stumbleupon.async.TimeoutException: Timed out after 3ms when 
>>> joining Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
>>> callback=passthrough -> scanner opened -> wakeup thread Executor task 
>>> launch worker-2, errback=openScanner errback -> passthrough -> wakeup 
>>> thread Executor task launch worker-2)
>>> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
>>> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
>>> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
>>> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> at 
>>> org.apache.spark.sql.execu

Re: Performance Question

2016-07-18 Thread Benjamin Kim

During my re-population of the Kudu table, I am getting this error trying to 
restart a tablet server after it went down. The job that populates this table 
has been running for over a week.

[libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of 
type "kudu.tablet.TabletSuperBlockPB" because it is missing required fields: 
rowsets[2324].columns[15].block
F0718 17:01:26.783571   468 tablet_server_main.cc:55] Check failed: _s.ok() Bad 
status: IO error: Could not init Tablet Manager: Failed to open tablet metadata 
for tablet: 24637ee6f3e5440181ce3f20b1b298ba: Failed to load tablet metadata 
for tablet id 24637ee6f3e5440181ce3f20b1b298ba: Could not load tablet metadata 
from /mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba: Unable 
to parse PB from path: 
/mnt/data1/kudu/data/tablet-meta/24637ee6f3e5440181ce3f20b1b298ba
*** Check failure stack trace: ***
@   0x7d794d  google::LogMessage::Fail()
@   0x7d984d  google::LogMessage::SendToLog()
@   0x7d7489  google::LogMessage::Flush()
@   0x7da2ef  google::LogMessageFatal::~LogMessageFatal()
@   0x78172b  (unknown)
@   0x344d41ed5d  (unknown)
@   0x7811d1  (unknown)

Does anyone know what this means?

Thanks,
Ben


> On Jul 11, 2016, at 10:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I had it at one replica. Do I have to recreate?
> 
> We don't currently have the ability to "accept data loss" on a tablet (or set 
> of tablets). If the machine is gone for good, then currently the only easy 
> way to recover is to recreate the table. If this sounds really painful, 
> though, maybe we can work up some kind of tool you could use to just recreate 
> the missing tablets (with those rows lost).
> 
> -Todd
> 
>> On Jul 11, 2016, at 10:37 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> Hey Ben,
>> 
>> Is the table that you're querying replicated? Or was it created with only 
>> one replica per tablet?
>> 
>> -Todd
>> 
>> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim <b...@amobee.com 
>> <mailto:b...@amobee.com>> wrote:
>> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
>> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
>> because I was getting a timeout  exception trying to do counts on the table. 
>> Now, when I try again. I get the same error.
>> 
>> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 
>> 0.0 (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
>> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
>> com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
>> Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
>> callback=passthrough -> scanner opened -> wakeup thread Executor task launch 
>> worker-2, errback=openScanner errback -> passthrough -> wakeup thread 
>> Executor task launch worker-2)
>> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
>> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
>> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
>> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.ru

Re: Spark Website

2016-07-13 Thread Benjamin Kim

It takes me to the directories instead of the webpage.

> On Jul 13, 2016, at 11:45 AM, manish ranjan <cse1.man...@gmail.com> wrote:
> 
> working for me. What do you mean 'as supposed to'?
> 
> ~Manish
> 
> 
> 
> On Wed, Jul 13, 2016 at 11:45 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Has anyone noticed that the spark.apache.org <http://spark.apache.org/> is 
> not working as supposed to?
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Spark Website

2016-07-13 Thread Benjamin Kim

Has anyone noticed that the spark.apache.org is not working as supposed to?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Zeppelin 0.6.0 on CDH 5.7.1

2016-07-12 Thread Benjamin Kim

 with Spark 2.0.
> 
> -users@ for now
> 
> This error seems to be serialization related. Commonly this can be caused by 
> mismatch versions. What is spark.master set to? Could you try with local[*] 
> instead of yarn-client to see if Spark running by Zeppelin is somehow 
> different?
> 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
> 
> 
> _
> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
> Sent: Saturday, July 9, 2016 10:54 PM
> Subject: Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released
> To: <us...@zeppelin.apache.org <mailto:us...@zeppelin.apache.org>>
> Cc: <dev@zeppelin.apache.org <mailto:dev@zeppelin.apache.org>>
> 
> 
> Hi JL,
> 
> Spark is version 1.6.0 and Akka is 2.2.3. But, Cloudera always back ports 
> things from newer versions. They told me that they ported some bug fixes from 
> Spark 2.0.
> 
> Please let me know if you need any more information.
> 
> Cheers,
> Ben
> 
> 
> On Jul 9, 2016, at 10:12 PM, Jongyoul Lee <jongy...@gmail.com 
> <mailto:jongy...@gmail.com>> wrote:
> 
> Hi all,
> 
> Could you guys check the CDH's version of Spark? As I've tested it for a long 
> time ago, it is a little bit different from vanila one, for example, the 
> CDH's one has a different version of some depedencies including Akka.
> 
> Regards,
> JL
> 
> On Sat, Jul 9, 2016 at 11:47 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Feix,
> 
> I added hive-site.xml to the conf directory and restarted Zeppelin. Now, I 
> get another error:
> 
> java.lang.ClassNotFoundException: 
> line1631424043$24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
> at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>

Re: Performance Question

2016-07-11 Thread Benjamin Kim

Todd,

It’s no problem to start over again. But, a tool like that would be helpful. 
Gaps in data can be accommodated for by just back filling.

Thanks,
Ben

> On Jul 11, 2016, at 10:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I had it at one replica. Do I have to recreate?
> 
> We don't currently have the ability to "accept data loss" on a tablet (or set 
> of tablets). If the machine is gone for good, then currently the only easy 
> way to recover is to recreate the table. If this sounds really painful, 
> though, maybe we can work up some kind of tool you could use to just recreate 
> the missing tablets (with those rows lost).
> 
> -Todd
> 
>> On Jul 11, 2016, at 10:37 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> Hey Ben,
>> 
>> Is the table that you're querying replicated? Or was it created with only 
>> one replica per tablet?
>> 
>> -Todd
>> 
>> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim <b...@amobee.com 
>> <mailto:b...@amobee.com>> wrote:
>> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
>> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
>> because I was getting a timeout  exception trying to do counts on the table. 
>> Now, when I try again. I get the same error.
>> 
>> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 
>> 0.0 (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
>> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
>> com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
>> Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
>> callback=passthrough -> scanner opened -> wakeup thread Executor task launch 
>> worker-2, errback=openScanner errback -> passthrough -> wakeup thread 
>> Executor task launch worker-2)
>> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
>> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
>> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
>> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 
>> Does anyone know how to recover from this?
>> 
>> Thanks,
>> Benjamin Kim
>> Data Solutions Architect
>> 
>> [a•mo•bee] (n.) the company defining digital marketing.
>> 
>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>> www.amobee.com <http://www.amobee.com/>
>>> On Jul 6, 2016, at 9:46 AM, Dan Burkert <d...@cloudera.com 
>>> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> 
>>> 
>>> On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Over the weekend, the row count is up to <500M. I will give it another few 
>>> days to get to 1B rows. I sti

Re: Performance Question

2016-07-11 Thread Benjamin Kim

Todd,

I had it at one replica. Do I have to recreate?

Thanks,
Ben


> On Jul 11, 2016, at 10:37 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Hey Ben,
> 
> Is the table that you're querying replicated? Or was it created with only one 
> replica per tablet?
> 
> -Todd
> 
> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim <b...@amobee.com 
> <mailto:b...@amobee.com>> wrote:
> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
> because I was getting a timeout  exception trying to do counts on the table. 
> Now, when I try again. I get the same error.
> 
> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 0.0 
> (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
> com.stumbleupon.async.TimeoutException: Timed out after 3ms when joining 
> Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
> callback=passthrough -> scanner opened -> wakeup thread Executor task launch 
> worker-2, errback=openScanner errback -> passthrough -> wakeup thread 
> Executor task launch worker-2)
> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> Does anyone know how to recover from this?
> 
> Thanks,
> Benjamin Kim
> Data Solutions Architect
> 
> [a•mo•bee] (n.) the company defining digital marketing.
> 
> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  www.amobee.com 
> <http://www.amobee.com/>
>> On Jul 6, 2016, at 9:46 AM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> 
>> 
>> On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Over the weekend, the row count is up to <500M. I will give it another few 
>> days to get to 1B rows. I still get consistent times ~15s for doing row 
>> counts despite the amount of data growing.
>> 
>> On another note, I got a solicitation email from SnappyData to evaluate 
>> their product. They claim to be the “Spark Data Store” with tight 
>> integration with Spark executors. It claims to be an OLTP and OLAP system 
>> with being an in-memory data store first then to disk. After going to 
>> several Spark events, it would seem that this is the new “hot” area for 
>> vendors. They all (MemSQL, Redis, Aerospike, Datastax, etc.) claim to be the 
>> best "Spark Data Store”. I’m wondering if Kudu will become this too? With 
>> the performance I’ve seen so far, it would seem that it can be a contender. 
>> All that is needed is a hardened Spark connector package, I would think. The 
>> next evaluation I will be conducting is to see if SnappyData’s claims are 
>> valid by doing my own tests.
>> 
>> It's hard to compare Kudu against any other data store without a lot of 
>> analysis and thorough benchmarking,

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-09 Thread Benjamin Kim

Hi JL,

Spark is version 1.6.0 and Akka is 2.2.3. But, Cloudera always back ports 
things from newer versions. They told me that they ported some bug fixes from 
Spark 2.0.

Please let me know if you need any more information.

Cheers,
Ben


> On Jul 9, 2016, at 10:12 PM, Jongyoul Lee <jongy...@gmail.com> wrote:
> 
> Hi all,
> 
> Could you guys check the CDH's version of Spark? As I've tested it for a long 
> time ago, it is a little bit different from vanila one, for example, the 
> CDH's one has a different version of some depedencies including Akka.
> 
> Regards,
> JL
> 
> On Sat, Jul 9, 2016 at 11:47 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Feix,
> 
> I added hive-site.xml to the conf directory and restarted Zeppelin. Now, I 
> get another error:
> 
> java.lang.ClassNotFoundException: 
> line1631424043$24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>   at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>   at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.Object

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-09 Thread Benjamin Kim

(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Thanks for helping.

Ben


> On Jul 8, 2016, at 10:47 PM, Felix Cheung <felixcheun...@hotmail.com> wrote:
> 
> For #1, do you know if Spark can find the Hive metastore config (typically in 
> hive-site.xml) - Spark's log should indicate that.
> 
> 
> _________
> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
> Sent: Friday, July 8, 2016 6:53 AM
> Subject: Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released
> To: <users@zeppelin.apache.org <mailto:users@zeppelin.apache.org>>
> Cc: <d...@zeppelin.apache.org <mailto:d...@zeppelin.apache.org>>
> 
> 
> Felix,
> 
> I forgot to add that I built Zeppelin from source 
> http://mirrors.ibiblio.org/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0.tgz 
> <http://mirrors.ibiblio.org/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0.tgz>
>  using this command "mvn clean package -DskipTests -Pspark-1.6 -Phadoop-2.6 
> -Dspark.version=1.6.0-cdh5.7.1 -Dhadoop.version=2.6.0-cdh5.7.1 -Ppyspark 
> -Pvendor-repo -Pbuild-distr -Dhbase.hbase.version=1.2.0-cdh5.7.1 
> -Dhbase.hadoop.version=2.6.0-cdh5.7.1”.
> 
> I did this because we are using HBase 1.2 within CDH 5.7.1.
> 
> Hope this helps clarify.
> 
> Thanks,
> Ben
> 
> 
> 
> On Jul 8, 2016, at 2:01 AM, Felix Cheung <felixcheun...@hotmail.com 
> <mailto:felixcheun...@hotmail.com>> wrote:
> 
> Is this possibly caused by CDH requiring a build-from-source instead of the 
> official binary releases?
> 
> 
> 
> 
> 
> On Thu, Jul 7, 2016 at 8:22 PM -0700, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
> Moon,
> 
> My environmental setup consists of an 18 node CentOS 6.7 cluster with 24 
> cores, 64GB, 12TB storage each:
> 3 of those nodes are used as Zookeeper servers, HDFS name nodes, and a YARN 
> resource manager
> 15 are for data nodes
> jdk1.8_60 and CDH 5.7.1 installed
> 
> Another node is an app server, 24 cores, 128GB memory, 1TB storage. It has 
> Zeppelin 0.6.0 and Livy 0.2.0 running on it. Plus, Hive Metastore and 
> HiveServer2, Hue, and Oozie are running on it from CDH 5.7.1.
> 
> This is our QA cluster where we are testing before deploying to production.
> 
> If you need more information, please let me know.
> 
> Thanks,
> Ben
> 
>  
> 
> On Jul 7, 2016, at 7:54 PM, moon soo Lee <m...@apache.org 
> <mailto:m...@apache.org>>

Re: Performance Question

2016-07-08 Thread Benjamin Kim

Dan,

This is good to hear as we are heavily invested in Spark as are many of our 
competitors in the AdTech/Telecom world. It would be nice to have Kudu be on 
par with the other data store technologies in terms of Spark usability, so as 
to not choose one based on “who provides it now in production”, as management 
tends to say.

Cheers,
Ben

> On Jul 6, 2016, at 9:46 AM, Dan Burkert <d...@cloudera.com> wrote:
> 
> 
> 
> On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Over the weekend, the row count is up to <500M. I will give it another few 
> days to get to 1B rows. I still get consistent times ~15s for doing row 
> counts despite the amount of data growing.
> 
> On another note, I got a solicitation email from SnappyData to evaluate their 
> product. They claim to be the “Spark Data Store” with tight integration with 
> Spark executors. It claims to be an OLTP and OLAP system with being an 
> in-memory data store first then to disk. After going to several Spark events, 
> it would seem that this is the new “hot” area for vendors. They all (MemSQL, 
> Redis, Aerospike, Datastax, etc.) claim to be the best "Spark Data Store”. 
> I’m wondering if Kudu will become this too? With the performance I’ve seen so 
> far, it would seem that it can be a contender. All that is needed is a 
> hardened Spark connector package, I would think. The next evaluation I will 
> be conducting is to see if SnappyData’s claims are valid by doing my own 
> tests.
> 
> It's hard to compare Kudu against any other data store without a lot of 
> analysis and thorough benchmarking, but it is certainly a goal of Kudu to be 
> a great platform for ingesting and analyzing data through Spark.  Up till 
> this point most of the Spark work has been community driven, but more 
> thorough integration testing of the Spark connector is going to be a focus 
> going forward.
> 
> - Dan
> 
>  
> Cheers,
> Ben
> 
> 
> 
>> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> Hi Benjamin,
>> 
>> What workload are you using for benchmarks? Using spark or something more 
>> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
>> some queries
>> 
>> Todd
>> 
>> Todd
>> 
>> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi Todd,
>> 
>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
>> Compared to HBase, read and write performance are better. Write performance 
>> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
>> only preliminary tests. Do you know of a way to really do some conclusive 
>> tests? I want to see if I can match your results on my 50 node cluster.
>> 
>> Thanks,
>> Ben
>> 
>>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com 
>>> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Todd,
>>> 
>>> It sounds like Kudu can possibly top or match those numbers put out by 
>>> Aerospike. Do you have any performance statistics published or any 
>>> instructions as to measure them myself as good way to test? In addition, 
>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>>> where support will be built in?
>>> 
>>> We don't have a lot of benchmarks published yet, especially on the write 
>>> side. I've found that thorough cross-system benchmarks are very difficult 
>>> to do fairly and accurately, and often times users end up misguided if they 
>>> pay too much attention to them :) So, given a finite number of developers 
>>> working on Kudu, I think we've tended to spend more time on the project 
>>> itself and less time focusing on "competition". I'm sure there are use 
>>> cases where Kudu will beat out Aerospike, and probably use cases where 
>>> Aerospike will beat Kudu as well.
>>> 
>>> From my perspective, it would be great if you can share some details of 
>>> your workload, especially if there are some areas you're finding Kudu 
>>> lacking. Maybe we can spot some easy code changes we could make to improve 
>>> performance, or suggest a tuning variable you could change.
>>> 
>>> -Todd
>>> 
>>> 
>>>> On May 27

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-08 Thread Benjamin Kim

Felix,

I forgot to add that I built Zeppelin from source 
http://mirrors.ibiblio.org/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0.tgz 
<http://mirrors.ibiblio.org/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0.tgz> 
using this command "mvn clean package -DskipTests -Pspark-1.6 -Phadoop-2.6 
-Dspark.version=1.6.0-cdh5.7.1 -Dhadoop.version=2.6.0-cdh5.7.1 -Ppyspark 
-Pvendor-repo -Pbuild-distr -Dhbase.hbase.version=1.2.0-cdh5.7.1 
-Dhbase.hadoop.version=2.6.0-cdh5.7.1”.

I did this because we are using HBase 1.2 within CDH 5.7.1.

Hope this helps clarify.

Thanks,
Ben



> On Jul 8, 2016, at 2:01 AM, Felix Cheung <felixcheun...@hotmail.com> wrote:
> 
> Is this possibly caused by CDH requiring a build-from-source instead of the 
> official binary releases?
> 
> 
> 
> 
> 
> On Thu, Jul 7, 2016 at 8:22 PM -0700, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
> Moon,
> 
> My environmental setup consists of an 18 node CentOS 6.7 cluster with 24 
> cores, 64GB, 12TB storage each:
> 3 of those nodes are used as Zookeeper servers, HDFS name nodes, and a YARN 
> resource manager
> 15 are for data nodes
> jdk1.8_60 and CDH 5.7.1 installed
> 
> Another node is an app server, 24 cores, 128GB memory, 1TB storage. It has 
> Zeppelin 0.6.0 and Livy 0.2.0 running on it. Plus, Hive Metastore and 
> HiveServer2, Hue, and Oozie are running on it from CDH 5.7.1.
> 
> This is our QA cluster where we are testing before deploying to production.
> 
> If you need more information, please let me know.
> 
> Thanks,
> Ben
> 
>  
> 
>> On Jul 7, 2016, at 7:54 PM, moon soo Lee <m...@apache.org 
>> <mailto:m...@apache.org>> wrote:
>> 
>> Randy,
>> 
>> Helium is not included in 0.6.0 release. Could you check which version are 
>> you using?
>> I created a fix for 500 errors from Helium URL in master branch. 
>> https://github.com/apache/zeppelin/pull/1150 
>> <https://github.com/apache/zeppelin/pull/1150>
>> 
>> Ben,
>> I can not reproduce the error, could you share how to reproduce error, or 
>> share your environment?
>> 
>> Thanks,
>> moon
>> 
>> On Thu, Jul 7, 2016 at 4:02 PM Randy Gelhausen <rgel...@gmail.com 
>> <mailto:rgel...@gmail.com>> wrote:
>> I don't- I hoped providing that information may help finding & fixing the 
>> problem.
>> 
>> On Thu, Jul 7, 2016 at 5:53 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi Randy,
>> 
>> Do you know of any way to fix it or know of a workaround?
>> 
>> Thanks,
>> Ben
>> 
>>> On Jul 7, 2016, at 2:08 PM, Randy Gelhausen <rgel...@gmail.com 
>>> <mailto:rgel...@gmail.com>> wrote:
>>> 
>>> HTTP 500 errors from a Helium URL
>> 
>> 
>

1 2 3 >

1 - 100 of 282 matches

Mail list logo