Re: Spark 2.1.1 (Scala 2.11.8) write to Phoenix 4.7 (HBase 1.1.2)

2018-01-31 Thread Josh Mahonin
Hi,

As per https://phoenix.apache.org/phoenix_spark.html, Apache Phoenix is
compiled against Spark2 only in versions 4.10 and above. If you must use
Phoenix 4.7 against Spark 2.x, you may need to apply PHOENIX- yourself:

https://github.com/apache/phoenix/commit/a0e5efcec5a1a732b2dce9794251242c3d66eea6#diff-600376dffeb79835ede4a0b285078036

Note that if you're using a vendor-provided Phoenix distribution, there may
already be support for the Spark version you're using. Please follow-up
with them if that's the case.

Best,

Josh




On Tue, Jan 30, 2018 at 10:06 AM, Margusja  wrote:

> Also I see that Logging is moved to internal/Logging. But is there package
> for my environment I can use?
>
> Margus
>
>
> On 30 Jan 2018, at 17:00, Margusja  wrote:
>
> Hi
>
> Followed page (https://phoenix.apache.org/phoenix_spark.html  phoenix.apache.org/phoenix_spark.html>) and trying to save to phoenix.
>
> Using spark-1.6.3 it is successful but using spark-2.1.1 it is not.
> First error I am getting using spark-2.1.1 is that:
>
> Error:scalac: missing or invalid dependency detected while loading class
> file 'ProductRDDFunctions.class'.
> Could not access type Logging in package org.apache.spark,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'ProductRDDFunctions.class' was compiled
> against an incompatible version of org.apache.spark.
>
> I can see that Logging is removed after 1.6.3 and does not exist in 2.1.1.
>
> What are my options?
>
> Br
> Margus
>
>
>


Re: [ANNOUNCE] New PMC Member: Sergey Soldatov

2017-09-25 Thread Josh Mahonin
Congratulations Sergey!

On Sun, Sep 24, 2017 at 4:05 PM, Ted Yu  wrote:

> Congratulations, Sergey !
>
> On Sun, Sep 24, 2017 at 1:00 PM, Josh Elser  wrote:
>
>> All,
>>
>> The Apache Phoenix PMC has recently voted to extend an invitation to
>> Sergey to join the PMC in recognition of his continued contributions to the
>> community. We are happy to share that he has accepted this offer.
>>
>> Please join me in congratulating Sergey! Congratulations on a
>> well-deserved invitation.
>>
>> - Josh (on behalf of the entire PMC)
>>
>
>


Re: Use Phoenix hints with Spark Integration [main use case: block cache disable]

2017-08-31 Thread Josh Mahonin
Hi Roberto,

At present, I don't believe there's any way to pass a query hint
explicitly, as the SELECT statement is built based on the table name and
columns, down in this method:

https://github.com/apache/phoenix/blob/892be13985658169ae581b3cb318845891f36b92/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixInputFormat.java#L176

However, it does seem that the Hive integration has this built-in, but
doesn't exist in the rest of the Phoenix MR codebase:

https://github.com/apache/phoenix/blob/616cd057d3c7d587aafe278948f8cff84efc9d29/phoenix-hive/src/main/java/org/apache/phoenix/hive/query/PhoenixQueryBuilder.java#L220-L235

Would you mind filing a JIRA ticket? As always, patches are welcome as
well. I suspect we should be disabling the block cache for phoenix-spark by
default as Hive does.

Thanks!

Josh


On Wed, Aug 30, 2017 at 7:11 AM, Roberto Coluccio 
wrote:

> Hello folks,
>
> I'm facing the issue of disabling adding to the block cache records I'm
> selecting from my Spark application when reading as DataFrame  (e.g.
> sqlContext.phoenixTableAsDataFrame(myTable, myColumns, myPredicate,
> myZkUrl, myConf).
>
> I know I can force the no cache on a query basis when issuing SQL queries
> leveraging the /*+ NO_CACHE */ hint.
> I know I can disable the caching at a table-specific or colum-family
> specific basis through an ALTER TABLE HBase shell command.
>
> What I don't know is how to do so when leveraging Phoenix-Spark APIs. I
> think my problem can be stated as a more general purpose question:
>
> *how can Phoenix hints be specified when using Phoenix-Spark APIs? *For
> my specific use case, I tried to push within a Configuration object the
> property *hfile.block.cache.size=0* before creating the DataFrame but I
> realized records resulting from the underneath scan where still cached.
>
> Thank you in advance for your help.
>
> Best regards,
> Roberto
>


Re: phoenix spark options not supporint query in dbtable

2017-08-17 Thread Josh Mahonin
You're mostly at the mercy of HBase and Phoenix to ensure that your data is
evenly distributed in the underlying regions. You could look at
pre-splitting or salting [1] your tables, as well as adjusting the
guidepost parameters [2] if you need finer tuned control.

If you end up with more idle Spark workers than RDD partitions, a pattern
I've seen is to simply repartition() the RDD / DataFrame after it's loaded
to a higher level of parallelism. You pay some overhead cost to
redistribute the data between executors, but you may make it up by having
more workers processing the data.

Josh

[1] https://phoenix.apache.org/salted.html
[2] https://phoenix.apache.org/tuning_guide.html

On Thu, Aug 17, 2017 at 2:36 PM, Kanagha <er.kana...@gmail.com> wrote:

> Thanks for the details.
>
> I tested out and saw that the no.of partitions equals to the no.of
> parallel scans run upon DataFrame load in phoenix 4.10.
> Also, how can we ensure that the read gets evenly distributed as tasks
> across the no.of executors set for the job? I'm running
> phoenixTableAsDataFrame API on a table with 4-way parallel scans and with
> 4 executors set for the job. Thanks for the inputs.
>
>
> Kanagha
>
> On Thu, Aug 17, 2017 at 7:17 AM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Hi,
>>
>> Phoenix is able to parallelize queries based on the underlying HBase
>> region splits, as well as its own internal guideposts based on statistics
>> collection [1]
>>
>> The phoenix-spark connector exposes those splits to Spark for the RDD /
>> DataFrame parallelism. In order to test this out, you can try run an
>> EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many
>> parallel scans will be run, and then compare those to the RDD / DataFrame
>> partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3],
>> they will be the same. In versions below that, the partition count will
>> equal the number of regions for that table.
>>
>> Josh
>>
>> [1] https://phoenix.apache.org/update_statistics.html
>> [2] https://phoenix.apache.org/tuning_guide.html
>> [3] https://issues.apache.org/jira/browse/PHOENIX-3600
>>
>>
>> On Thu, Aug 17, 2017 at 3:07 AM, Kanagha <er.kana...@gmail.com> wrote:
>>
>>> Also, I'm using phoenixTableAsDataFrame API to read from a pre-split
>>> phoenix table. How can we ensure read is parallelized across all executors?
>>> Would salting/pre-splitting tables help in providing parallelism?
>>> Appreciate any inputs.
>>>
>>> Thanks
>>>
>>>
>>> Kanagha
>>>
>>> On Wed, Aug 16, 2017 at 10:16 PM, kanagha <er.kana...@gmail.com> wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> Per your previous post, it is mentioned "The phoenix-spark parallelism
>>>> is
>>>> based on the splits provided by the Phoenix query planner, and has no
>>>> requirements on specifying partition columns or upper/lower bounds."
>>>>
>>>> Does it depend upon the region splits on the input table for
>>>> parallelism?
>>>> Could you please provide more details?
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-phoenix-user-lis
>>>> t.1124778.n5.nabble.com/phoenix-spark-options-not-supporint-
>>>> query-in-dbtable-tp1915p3810.html
>>>> Sent from the Apache Phoenix User List mailing list archive at
>>>> Nabble.com.
>>>>
>>>
>>>
>>
>


Re: Custom Connector for Prestodb

2017-08-17 Thread Josh Mahonin
Hi Luqman,

I just responded to another query on the list about phoenix-spark that may
help shed some light. In addition, the preferred locations the
phoenix-spark connector exposes are determined in the general
PhoenixInputFormat MapReduce code [1]

I'm not very familiar with PrestoDB, but if it's able to load data using a
general Hadoop InputFormat, the PhoenixInputFormat would be a good place to
start looking.

Josh

[1]
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-hive/src/main/java/org/apache/phoenix/hive/mapreduce/PhoenixInputFormat.java#L177-L178



On Thu, Aug 17, 2017 at 5:46 AM, Luqman Ghani  wrote:

> Hi,
>
> We are evaluating the possibility of writing a custom connector for
> Phoenix to access tables in stored in HBase. However, we need some help.
>
> The connector for Presto should be able to read from HBase cluster using
> parallel collections. For that the connector has a "ConnectorSplitManager"
> which needs to be implemented. To quote from here
> :
> "
> The split manager partitions the data for a table into the individual
> chunks that Presto will distribute to workers for processing. For example,
> the Hive connector lists the files for each Hive partition and creates one
> or more split per file. For data sources that don’t have partitioned data,
> a good strategy here is to simply return a single split for the entire
> table. This is the strategy employed by the Example HTTP connector.
> "
>
> I want to know if there's a way to implement Split Manager so that the
> data in HBase can be accessed by parallel connections. I was trying to
> follow the code for Phoenix-Spark connector
> 
>  to
> see how it decides getPreferredLocations to create splits, but couldn't
> understand.
>
> Any hints or code directions will be helpful.
>
> Regards,
> Luqman
>


Re: phoenix spark options not supporint query in dbtable

2017-08-17 Thread Josh Mahonin
Hi,

Phoenix is able to parallelize queries based on the underlying HBase region
splits, as well as its own internal guideposts based on statistics
collection [1]

The phoenix-spark connector exposes those splits to Spark for the RDD /
DataFrame parallelism. In order to test this out, you can try run an
EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many
parallel scans will be run, and then compare those to the RDD / DataFrame
partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3],
they will be the same. In versions below that, the partition count will
equal the number of regions for that table.

Josh

[1] https://phoenix.apache.org/update_statistics.html
[2] https://phoenix.apache.org/tuning_guide.html
[3] https://issues.apache.org/jira/browse/PHOENIX-3600


On Thu, Aug 17, 2017 at 3:07 AM, Kanagha  wrote:

> Also, I'm using phoenixTableAsDataFrame API to read from a pre-split
> phoenix table. How can we ensure read is parallelized across all executors?
> Would salting/pre-splitting tables help in providing parallelism?
> Appreciate any inputs.
>
> Thanks
>
>
> Kanagha
>
> On Wed, Aug 16, 2017 at 10:16 PM, kanagha  wrote:
>
>> Hi Josh,
>>
>> Per your previous post, it is mentioned "The phoenix-spark parallelism is
>> based on the splits provided by the Phoenix query planner, and has no
>> requirements on specifying partition columns or upper/lower bounds."
>>
>> Does it depend upon the region splits on the input table for parallelism?
>> Could you please provide more details?
>>
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context: http://apache-phoenix-user-lis
>> t.1124778.n5.nabble.com/phoenix-spark-options-not-supporint-
>> query-in-dbtable-tp1915p3810.html
>> Sent from the Apache Phoenix User List mailing list archive at Nabble.com.
>>
>
>


Re: Apache Spark Integration

2017-07-19 Thread Josh Mahonin
Hi Luqman,

At present, the phoenix-spark integration relies on the schema having been
already created.

There has been some discussion of augmenting the supported Spark
'SaveMode's to include 'CREATE IF NOT EXISTS' logic.

https://issues.apache.org/jira/browse/PHOENIX-2745
https://issues.apache.org/jira/browse/PHOENIX-2632

Contributions would be most welcome!

Josh


On Tue, Jul 18, 2017 at 6:50 AM, Luqman Ghani  wrote:

> Hi,
>
> I was wondering if phoenix-spark connector creates a new table if there
> doesn't exist one? Or do I have to create a table before calling
> saveToPhoenix function on a DataFrame? It is not evident from the above
> tests link provided by Ankit.
>
> Thanks,
> Luqman
>
> On Mon, Jul 17, 2017 at 11:23 PM, Luqman Ghani  wrote:
>
>> Thanks Ankit. I am sure this will help.
>>
>> On Mon, Jul 17, 2017 at 11:20 PM, Ankit Singhal > > wrote:
>>
>>> You can take a look at our IT tests for phoenix-spark module.
>>> https://github.com/apache/phoenix/blob/master/phoenix-spark/
>>> src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala
>>>
>>> On Mon, Jul 17, 2017 at 9:20 PM, Luqman Ghani  wrote:
>>>

 -- Forwarded message --
 From: Luqman Ghani 
 Date: Sat, Jul 15, 2017 at 2:38 PM
 Subject: Apache Spark Integration
 To: user@phoenix.apache.org


 Hi,

 I am evaluating which approach to use for integrating Phoenix with
 Spark, namely JDBC and phoenix-spark. I have one query regarding the
 following point stated in limitations in Apache Spark Integration
  section:
 "

- The Data Source API does not support passing custom Phoenix
settings in configuration, you must create the DataFrame or RDD 
 directly if
you need fine-grained configuration.

 "

 Can someone point me to or give an example on how to give such
 configuration?

 Also, it says in the docs
  that
 there is a 'save' function to save a dataframe to a table. But there is
 none. Instead, 'saveToPhoenix' shows up in my Intellij IDE suggestions. I'm
 using phoenix-4.11.0-HBase-1.2 and Spark-2.0.2. Is this an error in docs?

 Thanks,
 Luqman


>>>
>>
>


Re: Exception ERROR 201 (22000): Illegal data. Expected length of at least 8 bytes, but had 4

2017-07-06 Thread Josh Mahonin
Hi Takashi,

Thanks for the update. Do you think it could be the same issue as
https://issues.apache.org/jira/browse/PHOENIX-3453 ?

If not, it would be great if you could file a new JIRA ticket with as much
detail as possible, and ideally a simple way to reproduce it.

Thanks!

Josh

On Thu, Jul 6, 2017 at 6:36 AM, Takashi Sasaki <tsasaki...@gmail.com> wrote:

> Hi Josh,
>
>
> Well, it is difficult soon. I'll execute query directly if I have time.
>
> Information on the table is difficult to post due to confidentiality
> agreement, but negotiate with my boss and provide it if possible.
>
> By the way, changing all local indexes to all global indexes no longer
> causes the exception.
>
> It seems there is a problem with the local index.
>
>
> Thanks,
>
> Takashi
>
>
> 2017-07-05 22:35 GMT+09:00 Josh Mahonin <jmaho...@gmail.com>:
> > Hi,
> >
> > From the logs you attached, it appears that you're getting the exception
> on
> > the following query:
> >
> > SELECT trid, tid, frtp, frno, gzid, ontm, onty, onlt, onln,
> > oftm, ofty, oflt, ofln, onwk, onhr, wday, dist, drtn, delf, sntf,
> > cdtm, udtm FROM  trp WHERE  tid = ? AND delf = FALSE ORDER BY oftm
> > DESC NULLS LAST FETCH FIRST 1 ROW ONLY
> >
> > Are you able to reproduce this issue by executing that query directly?
> Given
> > a snippet of your stack trace, I'm not sure if Spark is the culprit
> here, so
> > it'd be nice to try identify the root cause (or maybe correlate it to an
> > existing JIRA ticket)
> >
> > ...
> > Caused by: java.sql.SQLException: ERROR 201 (22000): Illegal data.
> > ERROR 201 (22000): ERROR 201 (22000): Illegal data. Expected length of
> > at least 8 bytes, but had 4
> > TRP,\x0B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00,1499228698442.4f312a19bfdc624776516228c9c4f2d5.
> >  at
> > org.apache.phoenix.exception.SQLExceptionCode$Factory$1.
> newException(SQLExceptionCode.java:464)
> >  at
> > org.apache.phoenix.exception.SQLExceptionInfo.buildException(
> SQLExceptionInfo.java:150)
> >  at
> > org.apache.phoenix.util.ServerUtil.parseRemoteException(
> ServerUtil.java:134)
> >  at
> > org.apache.phoenix.util.ServerUtil.parseServerExceptionOrNull(
> ServerUtil.java:123)
> >  at
> > org.apache.phoenix.util.ServerUtil.parseServerException(
> ServerUtil.java:109)
> >  at
> > org.apache.phoenix.iterate.BaseResultIterators.getIterators(
> BaseResultIterators.java:755)
> >  at
> > org.apache.phoenix.iterate.BaseResultIterators.getIterators(
> BaseResultIterators.java:699)
> >  at
> > org.apache.phoenix.iterate.MergeSortResultIterator.getMinHeap(
> MergeSortResultIterator.java:72)
> >  at
> > org.apache.phoenix.iterate.MergeSortResultIterator.minIterator(
> MergeSortResultIterator.java:93)
> >  at
> > org.apache.phoenix.iterate.MergeSortResultIterator.next(
> MergeSortResultIterator.java:58)
> >  at
> > org.apache.phoenix.iterate.MergeSortTopNResultIterator.next(
> MergeSortTopNResultIterator.java:95)
> >  at org.apache.phoenix.jdbc.PhoenixResultSet.next(
> PhoenixResultSet.java:778)
> > ...
> >
> > Also, are you able to post your CREATE TABLE DDL for the 'trp' table
> you're
> > querying?
> >
> > Thanks,
> >
> > Josh
> >
> > On Wed, Jul 5, 2017 at 1:24 AM, Takashi Sasaki <tsasaki...@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> I'm using Phoenix4.9.0/HBase1.3.0 with Spark2.1.1 on AWS EMR 5.6.0.
> >>
> >> When one side uses Spark Streaming to write a lot of data and the
> >> other side uses Spark to read data,
> >> I encounter several similar exceptions.
> >>
> >> 17/07/05 04:33:47 ERROR Executor: Exception in task 83.0 in stage
> >> 166.0 (TID 95211)
> >> org.apache.ibatis.exceptions.PersistenceException:
> >> ### Error querying database.  Cause: java.sql.SQLException: ERROR 201
> >> (22000): Illegal data. ERROR 201 (22000): ERROR 201 (22000): Illegal
> >> data. Expected length of at least 8 bytes, but had 4
> >>
> >> TRP,\x0B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00,1499228698442.4f312a19bfdc624776516228c9c4f2d5.
> >> ### The error may exist in agp/mapper/TripMapper.java (best guess)
> >> ### The error may involve agp.mapper.

Re: Exception ERROR 201 (22000): Illegal data. Expected length of at least 8 bytes, but had 4

2017-07-05 Thread Josh Mahonin
Hi,

>From the logs you attached, it appears that you're getting the exception on
the following query:

SELECT trid, tid, frtp, frno, gzid, ontm, onty, onlt, onln,
oftm, ofty, oflt, ofln, onwk, onhr, wday, dist, drtn, delf, sntf,
cdtm, udtm FROM  trp WHERE  tid = ? AND delf = FALSE ORDER BY oftm
DESC NULLS LAST FETCH FIRST 1 ROW ONLY

Are you able to reproduce this issue by executing that query directly?
Given a snippet of your stack trace, I'm not sure if Spark is the culprit
here, so it'd be nice to try identify the root cause (or maybe correlate it
to an existing JIRA ticket)

...
Caused by: java.sql.SQLException: ERROR 201 (22000): Illegal data.
ERROR 201 (22000): ERROR 201 (22000): Illegal data. Expected length of
at least 8 bytes, but had 4
TRP,\x0B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,
1499228698442.4f312a19bfdc624776516228c9c4f2d5.
 at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.
newException(SQLExceptionCode.java:464)
 at org.apache.phoenix.exception.SQLExceptionInfo.buildException(
SQLExceptionInfo.java:150)
 at org.apache.phoenix.util.ServerUtil.parseRemoteException(
ServerUtil.java:134)
 at org.apache.phoenix.util.ServerUtil.parseServerExceptionOrNull(
ServerUtil.java:123)
 at org.apache.phoenix.util.ServerUtil.parseServerException(
ServerUtil.java:109)
 at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
BaseResultIterators.java:755)
 at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
BaseResultIterators.java:699)
 at org.apache.phoenix.iterate.MergeSortResultIterator.getMinHeap(
MergeSortResultIterator.java:72)
 at org.apache.phoenix.iterate.MergeSortResultIterator.minIterator(
MergeSortResultIterator.java:93)
 at org.apache.phoenix.iterate.MergeSortResultIterator.next(
MergeSortResultIterator.java:58)
 at org.apache.phoenix.iterate.MergeSortTopNResultIterator.next(
MergeSortTopNResultIterator.java:95)
 at org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:778)
...

Also, are you able to post your CREATE TABLE DDL for the 'trp' table you're
querying?

Thanks,

Josh

On Wed, Jul 5, 2017 at 1:24 AM, Takashi Sasaki  wrote:

> Hi,
>
> I'm using Phoenix4.9.0/HBase1.3.0 with Spark2.1.1 on AWS EMR 5.6.0.
>
> When one side uses Spark Streaming to write a lot of data and the
> other side uses Spark to read data,
> I encounter several similar exceptions.
>
> 17/07/05 04:33:47 ERROR Executor: Exception in task 83.0 in stage
> 166.0 (TID 95211)
> org.apache.ibatis.exceptions.PersistenceException:
> ### Error querying database.  Cause: java.sql.SQLException: ERROR 201
> (22000): Illegal data. ERROR 201 (22000): ERROR 201 (22000): Illegal
> data. Expected length of at least 8 bytes, but had 4
> TRP,\x0B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00,1499228698442.4f312a19bfdc624776516228c9c4f2d5.
> ### The error may exist in agp/mapper/TripMapper.java (best guess)
> ### The error may involve agp.mapper.TripMapper.selectLast
> ### The error occurred while handling results
> ### SQL: SELECT trid, tid, frtp, frno, gzid, ontm, onty, onlt, onln,
> oftm, ofty, oflt, ofln, onwk, onhr, wday, dist, drtn, delf, sntf,
> cdtm, udtm FROM  trp WHERE  tid = ? AND delf = FALSE ORDER BY oftm
> DESC NULLS LAST FETCH FIRST 1 ROW ONLY
> ### Cause: java.sql.SQLException: ERROR 201 (22000): Illegal data.
> ERROR 201 (22000): ERROR 201 (22000): Illegal data. Expected length of
> at least 8 bytes, but had 4
> TRP,\x0B\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
> x00\x00\x00\x00\x00,1499228698442.4f312a19bfdc624776516228c9c4f2d5.
>  at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(
> ExceptionFactory.java:30)
>  at org.apache.ibatis.session.defaults.DefaultSqlSession.
> selectList(DefaultSqlSession.java:150)
>  at org.apache.ibatis.session.defaults.DefaultSqlSession.
> selectList(DefaultSqlSession.java:141)
>  at org.apache.ibatis.session.defaults.DefaultSqlSession.
> selectOne(DefaultSqlSession.java:77)
>  at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:82)
>  at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:59)
>  at com.sun.proxy.$Proxy38.selectLast(Unknown Source)
>  at agp.dao.TripDAO.selectLast(TripDAO.java:55)
>  at agp.trip.func.trip.SelectTripCondition.fixAccOnLocation(
> SelectTripCondition.java:128)
>  at agp.trip.func.trip.SelectTripCondition.call(
> SelectTripCondition.java:65)
>  at agp.trip.func.trip.SelectTripCondition.call(
> SelectTripCondition.java:25)
>  at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.
> apply(JavaRDDLike.scala:125)
>  at 

Re: Getting Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: org.apache.phoenix.spark. Please find packages at http://spark-packages.org Exception

2017-03-16 Thread Josh Mahonin
Hi Sateesh,

It seems you are missing the import which gives Spark visibility into the
"org.apache.phoenix.spark". From the documentation page:

*import org.apache.phoenix.spark._*

I'm not entirely sure how this works in Java, however. You might have some
luck with:

*import static org.apache.phoenix.spark.*;*

If you do get this working, please update the list. It would be nice to add
this to the existing documentation.

Josh



On Wed, Mar 15, 2017 at 2:06 PM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:

> Hello folks..,
>
> I am trying to execute sample spark-phoenix application.
> but i am getting
>  Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find data source: org.apache.phoenix.spark. Please find packages at
> http://spark-packages.org exception.
>
> here is my code:
>
> package com.inndata.spark.sparkphoenix;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.sql.DataFrame;
> import org.apache.spark.sql.SQLContext;
>
> import java.io.Serializable;
>
> /**
>  *
>  */
> public class SparkConnection implements Serializable {
>
> public static void main(String args[]) {
> SparkConf sparkConf = new SparkConf();
> sparkConf.setAppName("spark-phoenix-df");
> sparkConf.setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
>
> DataFrame df = sqlContext.read()
> .format("org.apache.phoenix.spark")
> .option("table", "ORDERS")
> .option("zkUrl", "localhost:2181")
> .load();
> df.count();
>
> }
> }
>
> and here is my pom.xml:
>
> 
>   org.apache.phoenix
>   phoenix-core
>   4.8.0-HBase-1.2
> 
>
> 
>   org.scala-lang
>   scala-library
>   2.10.6
>   provided
> 
> 
> org.apache.phoenix
> phoenix-spark
> 4.8.0-HBase-1.2
>
> 
>
>
> 
>   org.apache.spark
>   spark-core_2.10
>   1.6.2
> 
>
> 
> org.apache.spark
> spark-sql_2.10
> 1.6.2
> 
>
>
> 
>   org.apache.hadoop
>   hadoop-client
>   2.7.3
>
> 
>
> 
>   org.apache.hadoop
>   hadoop-common
>   2.7.3
>
> 
>
> 
>   org.apache.hadoop
>   hadoop-common
>   2.7.3
>
> 
>
> 
>   org.apache.hadoop
>   hadoop-hdfs
>   2.7.3
>
> 
>
> 
>   org.apache.hbase
>   hbase-client
>   1.2.4
>
>
> 
>
>
>
> 
>   org.apache.hbase
>   hbase-hadoop-compat
>   1.2.4
>
> 
>
> 
>   org.apache.hbase
>   hbase-hadoop2-compat
>   1.2.4
>
> 
> 
>   org.apache.hbase
>   hbase-server
>   1.2.4
> 
>
> 
>   org.apache.hbase
>   hbase-it
>   1.2.4
>   test-jar
> 
>
>
> 
>   junit
>   junit
>   3.8.1
>   test
> 
>   
>
>
> here is the stackoverflow link:
>
>
> http://stackoverflow.com/questions/42816998/getting-failed-to-find-data-source-org-apache-phoenix-spark-please-find-packag
>
> please help me out.
>
>


Re: PySpark and Phoenix Dynamic Columns

2017-02-24 Thread Josh Mahonin
Hi Craig,

I think this is an open issue in PHOENIX-2648 (
https://issues.apache.org/jira/browse/PHOENIX-2648)

There seems to be a workaround by using a 'VIEW' instead, as mentioned in
that ticket.

Good luck,

Josh

On Thu, Feb 23, 2017 at 11:56 PM, Craig Roberts 
wrote:

> Hi all,
>
> I've got a (very) basic Spark application in Python that selects some
> basic information from my Phoenix table. I can't quite figure out how (or
> even if I can) select dynamic columns through this, however.
>
> Here's what I have;
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
>
> df = sqlContext.read.format("org.apache.phoenix.spark") \
>.option("table", """MYTABLE("dyamic_column" VARCHAR)""") \
>.option("zkUrl", "127.0.0.1:2181:/hbase-unsecure") \
>.load()
>
> df.show()
> df.printSchema()
>
>
> I get a "org.apache.phoenix.schema.TableNotFoundException:" error for the
> above.
>
> If I try and load the data frame as a table and query that with SQL:
>
> sqlContext.registerDataFrameAsTable(df, "test")
> sqlContext.sql("""SELECT * FROM test("dynamic_column" VARCHAR)""")
>
>
> I get a bit of a strange exception:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o37.sql.
> : java.lang.RuntimeException: [1.19] failure: ``union'' expected but `('
> found
>
> SELECT * FROM test("dynamic_column" VARCHAR)
>
>
>
> Does anybody have a pointer on whether this is supported and how I might
> be able to query a dynamic column? I haven't found much information on the
> wider Internet about Spark + Phoenix integration for this kind of
> thing...Simple selects are working. Final note: I have (rather stupidly)
> lower-cased my column names in Phoenix, so I need to quote them when I
> execute a query (I'll be changing this as soon as possible).
>
> Any assistance would be appreciated :)
> *-- Craig*
>


Re: Still having issues

2017-02-16 Thread Josh Mahonin
It still seems that Spark is unable to find all of the Phoenix/HBase
classes that are necessary.

As a reference, I've got a Docker image that might help:

https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark

The versions of Phoenix and Spark it uses are a bit out of date, but it
shows the necessary classes and settings to make Spark happy.

Good luck!

Josh

On Thu, Feb 16, 2017 at 4:10 AM, Dequn Zhang 
wrote:

> Please check whether your table is created by Phoenix( means this table is
> not a *Mapping* ) , you can follow the sample on phoenix official site,
> only need *change the version to the latest*, use *phoenix-client*
> instead, and promise *Schema Corresponding*. Create a new table to test,
> use simple data, *be independent to your business*.
>
> We don’t know your background and code, so that’s all I can help you.
>
>
> On 16 February 2017 at 16:57:26, Nimrod Oren (nimrod.oren@veracity-group.
> com) wrote:
>
> org.apache.hadoop.hbase.HTableDescriptor.setValue
>
>


Re: FW: Failing on writing Dataframe to Phoenix

2017-02-15 Thread Josh Mahonin
Hi,

Spark is unable to load the Phoenix classes it needs. If you're using a
recent version of Phoenix, please ensure the "fat" *client* JAR (or for
older versions of Phoenix, the Phoenix *client*-spark JAR) is on your Spark
driver and executor classpath [1]. The 'phoenix-spark' JAR is insufficient
to provide Spark all of the classes necessary.

[1] https://phoenix.apache.org/phoenix_spark.html

On Wed, Feb 15, 2017 at 10:29 AM, Nimrod Oren <
nimrod.o...@veracity-group.com> wrote:

> Hi,
>
>
>
> I'm trying to write a simple dataframe to Phoenix:
>
>  df.save("org.apache.phoenix.spark", SaveMode.Overwrite,
>
>   Map("table" -> "TEST_SAVE", "zkUrl" -> "zk.internal:2181"))
>
>
>
> I have the following in my pom.xml:
>
> 
>
> org.apache.phoenix
>
> phoenix-spark
>
> ${phoenix-version}
>
> provided
>
> 
>
>
>
> and phoenix-spark is in spark-defaults.conf on all servers. However I'm
> getting the following error:
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/phoenix/util/SchemaUtil
>
> at org.apache.phoenix.spark.DataFrameFunctions$$anonfun$1.
> apply(DataFrameFunctions.scala:33)
>
> at org.apache.phoenix.spark.DataFrameFunctions$$anonfun$1.
> apply(DataFrameFunctions.scala:33)
>
> at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:244)
>
> at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:244)
>
> at scala.collection.IndexedSeqOptimized$class.
> foreach(IndexedSeqOptimized.scala:33)
>
> at scala.collection.mutable.ArrayOps$ofRef.foreach(
> ArrayOps.scala:108)
>
> at scala.collection.TraversableLike$class.map(
> TraversableLike.scala:244)
>
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(
> DataFrameFunctions.scala:33)
>
> at org.apache.phoenix.spark.DefaultSource.createRelation(
> DefaultSource.scala:47)
>
> at org.apache.spark.sql.execution.datasources.
> ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
>
> at org.apache.spark.sql.DataFrameWriter.save(
> DataFrameWriter.scala:148)
>
> at org.apache.spark.sql.DataFrame.save(DataFrame.scala:2045)
>
> at com.pelephone.TrueCallLoader$.main(TrueCallLoader.scala:184)
>
> at com.pelephone.TrueCallLoader.main(TrueCallLoader.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
> SparkSubmit.scala:181)
>
> at org.apache.spark.deploy.SparkSubmit$.submit(
> SparkSubmit.scala:206)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.
> scala:121)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.lang.ClassNotFoundException: org.apache.phoenix.util.
> SchemaUtil
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
>
>
> Am I missing something?
>
>
>
> Nimrod
>
>
>
>
>
>
>


Re: Phoenix-Spark Integration Java Code Throwing org.apache.hadoop.mapred.InvalidJobConfException Exception

2017-02-01 Thread Josh Mahonin
Hi Ravi,

It looks like you're invoking the PhoenixInputFormat class directly from
Spark, which actually bypasses the phoenix-spark integration completely.

Others on the list might be more helpful with regards to Java
implementation, but I suspect if you start with using the DataFrame API,
following something similar to the PySpark example in the documentation
[1], you'll be able to load your data. If and when you get something
working, please reply to the list or submit a patch with the Java code that
worked for you, we can update the documentation as well.

Thanks!

Josh

[1] https://phoenix.apache.org/phoenix_spark.html



On Wed, Feb 1, 2017 at 3:02 AM, Ravi Kumar Bommada 
wrote:

> Hi,
>
> I’m trying to write a phoenix-spark sample job in java to read few colums
> from hbase and write it back to hbase after some manipulation. while
> running this job I’m getting exception saying 
> “org.apache.hadoop.mapred.InvalidJobConfException:
> Output directory not set”, thou I had set the outputformat as
> PhoenixoutputFormat, please find the code and exception attached .The
> command to submit the job is mentioned below, any leads would be
> appreciated.
>
> Spark job submit command:
>
> spark-submit --class bulk_test.PhoenixSparkJob --driver-class-path
> /home/cloudera/Desktop/phoenix-client-4.5.2-1.clabs_phoenix1.2.0.p0.774.jar
> --master local myjar.jar
>
>
>
>
>
> Regard’s
>
> Ravi Kumar B
> Mob: +91 9591144511 <+91%2095911%2044511>
>
>
>
>


Re: Moving column family into new table

2017-01-19 Thread Josh Mahonin
It's a bit peculiar that you've got it pre-split to 10 salt buckets, but
seeing 400+ partitions. It sounds like HBase is splitting the regions on
you, possibly due to the 'hbase.hregion.max.filesize' setting. You should
be able to check the HBase Master UI and see the table details to see how
many regions there are, and what nodes they're located on. Right now, the
Phoenix MR / Spark integration basically assigns one partition per region.

As a total guess, I wonder if somehow the first 380 partitions are
relatively sparse, and the bulk of the data is in the remaining 70
partitions. You might be able to diagnose that by adding some logging in a
'mapPartitions()' call. It's possible that running a major compaction on
that table might help redistribute the data as well.

If you're seeing your task getting killed, definitely try dig into the
Spark executor / driver logs to try find a root cause. If you're using
YARN, you can usually get into the Spark history server, then check the
'stdout' / 'stderr' logs for each executor.

Re: architecture recommendations, it's possible that phoenix-spark isn't
the right tool for this job, though we routinely read / write billions of
rows with it. I'd recommend trying to start with a smaller subset of your
data and make sure you've got the schema, queries and HBase settings setup
the way you like, then add Spark into the mix. Then start adding a bit more
data, check results, find any bottlenecks, and tune as needed.

If you're able to identify any issues specifically with Phoenix, bug
reports and patches are greatly appreciated!

Best of luck,

Josh


On Thu, Jan 19, 2017 at 12:30 PM, Mark Heppner <heppner.m...@gmail.com>
wrote:

> Thanks for the quick reply, Josh!
>
> For our demo cluster, we have 5 nodes, so the table was already set to 10
> salt buckets. I know you can increase the salt buckets after the table is
> created, but how do you change the split points? The repartition in Spark
> seemed to be extremely inefficient, so we were trying to skip it and keep
> the 400+ default partitions.
>
> The biggest issue we're facing is that as Spark goes through the
> partitions during the scan, it becomes exponentially slower towards the
> end. Around task 380/450, it slows down to a halt, eventually timing out
> around 410 and getting killed. We have no idea if this is something with
> Spark, YARN, or HBase, so that's why we were brainstorming with using the
> foreign key-based layout, hoping that the files on HDFS would be more
> compacted.
>
> We haven't noticed too much network overhead, nor have we seen CPU or RAM
> usage too high. Our nodes are pretty big, 32 cores and 256 GB RAM each,
> connected on a 10 GbE network. Even if our query is for 80-100 rows, the
> Spark job still slows to a crawl at the end, but that should really only be
> about 80 MB of data it would be pulling out of Phoenix into the executors.
> I guess we should have verified that the Phoenix+Spark plugin did achieve
> data locality, but there isn't anything that says otherwise. Even though it
> doesn't have data locality, we have no idea why it would progressively slow
> down as it reaches the end of the scan/filter.
>
> The images are converted to a NumPy array, then saved as a binary string
> into Phoenix. In Spark, this is fairly quick to convert the binary string
> back to the NumPy array. This also allows us to use GET_BYTE() from Phoenix
> to extract specific values within the array, without going through Spark at
> all. Do you have any other architecture recommendations for our use case?
> Would storing the images directly in HBase be any better?
>
> On Thu, Jan 19, 2017 at 12:02 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Hi Mark,
>>
>> At present, the Spark partitions are basically equivalent to the number
>> of regions in the underlying HBase table. This is typically something you
>> can control yourself, either using pre-splitting or salting (
>> https://phoenix.apache.org/faq.html#Are_there_any_tips_for_
>> optimizing_Phoenix). Given that you have 450+ partitions though, it
>> sounds like you should be able to achieve a decent level or parallelism,
>> but that's a knob you can fiddle with. It might also be useful to look at
>> Spark's "repartition" operation if you have idle Spark executors.
>>
>> The partitioning is sort of orthogonal from the primary key layout and
>> the resulting query efficiency, but the strategy you've taken with your
>> schema seems fairly sensible to me. Given that your primary key is the 'id'
>> field, the query you're using is going to be much more efficient than,
>> e.g., filtering on the 'title' column. Iterating on your schema and queries
>> using straight SQL and then applying that to Spark after is probably a good
>

Re: Moving column family into new table

2017-01-19 Thread Josh Mahonin
Hi Mark,

At present, the Spark partitions are basically equivalent to the number of
regions in the underlying HBase table. This is typically something you can
control yourself, either using pre-splitting or salting (
https://phoenix.apache.org/faq.html#Are_there_any_tips_for_optimizing_Phoenix).
Given that you have 450+ partitions though, it sounds like you should be
able to achieve a decent level or parallelism, but that's a knob you can
fiddle with. It might also be useful to look at Spark's "repartition"
operation if you have idle Spark executors.

The partitioning is sort of orthogonal from the primary key layout and the
resulting query efficiency, but the strategy you've taken with your schema
seems fairly sensible to me. Given that your primary key is the 'id' field,
the query you're using is going to be much more efficient than, e.g.,
filtering on the 'title' column. Iterating on your schema and queries using
straight SQL and then applying that to Spark after is probably a good
strategy here to get more familiar with query performance.

If you're reading the binary 'data' column in Spark and seeing a lot of
network overhead, one thing to be aware of is the present Phoenix MR /
Spark code isn't location aware, so executors are likely reading big chunks
of data from another node. There's a few patches in to address this, but
they're not in a released version yet:

https://issues.apache.org/jira/browse/PHOENIX-3600
https://issues.apache.org/jira/browse/PHOENIX-3601

Good luck!

Josh




On Thu, Jan 19, 2017 at 11:30 AM, Mark Heppner 
wrote:

> Our use case is to analyze images using Spark. The images are typically
> ~1MB each, so in order to prevent the small files problem in HDFS, we went
> with HBase and Phoenix. For 20+ million images and metadata, this has been
> working pretty well so far. Since this is pretty new to us, we didn't
> create a robust design:
>
> CREATE TABLE IF NOT EXISTS mytable
> (
> id VARCHAR(36) NOT NULL PRIMARY KEY,
> title VARCHAR,
> ...
> image.dtype VARCHAR(12),
> image.width UNSIGNED_INT,
> image.height UNSIGNED_INT,
> image.data VARBINARY
> )
>
> Most queries are on the metadata, so all of that is kept in the default
> column family. Only the image data is stored in a secondary column family.
> Additional indexes are created anyways, so the main table isn't usually
> touched.
>
> We first run a Phoenix query to check if there are any matches. If so,
> then we start a Spark job on the images. The primary keys are sent to the
> PySpark job, which then grabs the images based on the primary keys:
>
> df = sqlContext.read \
> .format('org.apache.phoenix.spark') \
> .option('table', 'mytable') \
> .option('zkUrl', 'localhost:2181:/hbase-unsecure') \
> .load()
> df.registerTempTable('mytable')
>
> query =
> df_imgs = sqlContext.sql(
> 'SELECT IMAGE FROM mytable WHERE ID = 1 OR ID = 2 ...'
> )
>
> When this was first designed, we thought since the lookup was by primary
> key, it would be smart enough to do a skip scan, but it appears to be doing
> a full scan. The df_imgs.rdd.getNumPartitions() ends up being 450+, which
> matches up with the number of split files in HDFS.
>
> Would it be better to use a foreign key and split the tables :
>
> CREATE TABLE IF NOT EXISTS mytable
> (
> id VARCHAR(36) NOT NULL PRIMARY KEY,
> title VARCHAR,
> image_id VARCHAR(36)
> )
> CREATE TABLE IF NOT EXISTS images
> (
> image_id VARCHAR(36) NOT NULL PRIMARY KEY,
> dtype VARCHAR(12),
> width UNSIGNED_INT,
> height UNSIGNED_INT,
> data VARBINARY
> )
>
> If the first query grabs the image_ids and send them to Spark, would Spark
> be able to handle the query more efficiently?
>
> If this is a better design, is there any way of moving the "image" column
> family from "mytable" to the default column family of the new "images"
> table? Is it possible to create the new table with the "image_id"s, make
> the foreign keys, then move the column family into the new table?
>
>
> --
> Mark Heppner
>


Re: how to write spark2 dataframe to phoenix?

2017-01-11 Thread Josh Mahonin
Hi,

Spark 2.x isn't currently supported in a released Phoenix version, but is
slated for the upcoming 4.10.0 release.

If you'd like to compile your own version in the meantime, you can find the
ticket/patch here:
https://issues.apache.org/jira/browse/PHOENIX-

or:
https://github.com/apache/phoenix/commit/a0e5efcec5a1a732b2dce9794251242c3d66eea6

Josh

On Tue, Jan 10, 2017 at 10:27 PM, lk_phoenix  wrote:

> hi,all:
> I try to write a dataframe to phoenix with spark2.1:
> 
> df.write.format("org.apache.phoenix.spark").mode(SaveMode.Overwrite).options(Map("table"
> -> "biz","zkUrl" -> "slave1:2181,slave2:2181")).save()
> but it's not work,I got error:
> java.lang.AbstractMethodError: org.apache.phoenix.spark.
> DefaultSource.createRelation(Lorg/apache/spark/sql/
> SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/
> collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)
> Lorg/apache/spark/sql/sources/BaseRelation;
>   at org.apache.spark.sql.execution.datasources.
> DataSource.write(DataSource.scala:426)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
>   ... 60 elided
>
> 2017-01-11
> --
> lk_phoenix
>


Re: Spark hang on load phoenix table

2016-11-16 Thread Josh Mahonin
Hi,

Are there any logs in the Spark driver and executors which would help
provide some context? In diagnosing, increasing the log level to DEBUG
might be useful as well.

Also, the snippet you posted is a 'lazy' operation. In theory it should
return quickly, and only evaluate when some sort of Spark action is
performed on it (e.g., count, distinct, save, etc.). If the operation ends
up hanging, perhaps there's some sort of connectivity issue, or maybe the
Zookeeper Znode Parent is missing from the full URL?

Best,

Josh



On Tue, Nov 15, 2016 at 9:18 PM, 马骉  wrote:

> Hi all
> Have you met the problem that submit the job to the Spark, then it
> just hang on? There showed no stage message on the 8080 web UI.
> It seems this is something wrong in
> val df = sqlContext.load(
>
>   "org.apache.phoenix.spark",
>   Map("table" -> "\"test_table\"", "zkUrl" -> "10.17.10.2:2181")
> )
>
>Any ideas, please
> --
> Warmest Regards~
> From BiaoMa
>
>


Re: Inserting into Temporary Phoenix table using Spark plugin

2016-11-16 Thread Josh Mahonin
Hi Hussain,

I'm not familiar with the Spark temporary table syntax. Perhaps you can
work around it by using other options, such as the DataFrame.save()
functionality which is documented [1] and unit tested [2].

I suspect what you're encountering is a valid use case. If you could also
file a JIRA ticket, and as a bonus, provide a patch, that would be great.

Best of luck,

Josh

[1] https://phoenix.apache.org/phoenix_spark.html
[2]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala#L297

On Wed, Nov 16, 2016 at 4:10 AM, Hussain Pirosha <
hussain.piro...@impetus.co.in> wrote:

> I am trying to insert into temporary table created on a Spark (v 1.6)
> DataFrame loaded using Phoenix-Spark (v 4.4) plugin. Below is the code:
>
> val sc = new SparkContext("local", "phoenix-test")
>
> val configuration = new Configuration()
>
> configuration.set("zookeeper.znode.parent", "/hbase-unsecure")
>
> val df = sqlContext.phoenixTableAsDataFrame(
>
>   "EMAIL_ENRON", Array("MAIL_FROM", "MAIL_TO"), conf = configuration)
>
> df.registerTempTable("TEMP_TABLE");
>
> Table which is defined in HBase using Phoenix is :
>
> TABLE EMAIL_ENRON(MAIL_FROM BIGINT NOT NULL, MAIL_TO BIGINT NOT NULL
> CONSTRAINT pk PRIMARY KEY(MAIL_FROM, MAIL_TO));
>
> While trying to insert into temporary table, spark-shell gives the below
> error
>
> Insert statement:
>
> sqlContext.sql("insert into table TEMP_TABLE select t.* from (select
> 55,66) t");
>
> Exception:
>
> scala> sqlContext.sql("insert into table TEMP_TABLE select t.* from
> (select 55,66) t");
>
> 16/11/16 11:15:46 INFO ParseDriver: Parsing command: insert into table
> TEMP_TABLE select t.* from (select 55,66) t
>
> 16/11/16 11:15:46 INFO ParseDriver: Parse Completed
>
> org.apache.spark.sql.AnalysisException: unresolved operator
> 'InsertIntoTable LogicalRDD [MAIL_FROM#0L,MAIL_TO#1L], MapPartitionsRDD[3]
> at createDataFrame at PhoenixRDD.scala:117, Map(), false, false;
>
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
> failAnalysis(CheckAnalysis.scala:38)
>
> at org.apache.spark.sql.catalyst.analysis.Analyzer.
> failAnalysis(Analyzer.scala:44)
>
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:203)
>
> What is the correct way to insert into a temporary table ? From the
> exception it looks like that Phoenix-Spark does allow inserting into
> temporary table, may be the syntax that i am using is incorrect.
>
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Accessing phoenix tables in Spark 2

2016-10-07 Thread Josh Mahonin
Hi Mich,

You're correct that the rowkey is the primary key, but if you're writing to
HBase directly and bypassing Phoenix, you'll have to be careful about the
construction of your row keys to adhere to the Phoenix data types and row
format. I don't think it's very well documented, but you might have some
luck by checking with the data type implementations here:
https://github.com/apache/phoenix/tree/master/phoenix-
core/src/main/java/org/apache/phoenix/schema/types

Another option is to use Phoenix-JDBC from within Spark Streaming. I've got
a toy example of using Spark streaming with Phoenix DataFrames, but it
could just as easily be a batched JDBC upsert.
https://github.com/jmahonin/spark-streaming-phoenix/blob/
master/src/main/scala/SparkStreamingPhoenix.scala

Best of luck,

Josh

On Fri, Oct 7, 2016 at 10:28 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thank you all. very helpful.
>
> I have not tried the method Ciureanu suggested but will do so.
>
> Now I will be using Spark Streaming to populate Hbase table. I was hoping
> to do this through Phoenix but managed to write a script to write to Hbase
> table from Spark 2 itself.
>
> Having worked with Hbase I take the row key to be primary key, i.e. unique
> much like RDBMS (Oracle). Sounds like phoenix relies on that one when
> creating table on top of Hbase table. Is this assessment correct please?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 7 October 2016 at 14:30, Ciureanu Constantin <
> ciureanu.constan...@gmail.com> wrote:
>
>> In Spark 1.4 it worked via JDBC - sure it would work in 1.6 / 2.0 without
>> issues.
>>
>> Here's a sample code I used (it was getting data in parallel 24
>> partitions)
>>
>>
>> import org.apache.spark.SparkConf
>> import org.apache.spark.SparkContext
>>
>> import org.apache.spark.rdd.JdbcRDD
>> import java.sql.{Connection, DriverManager, ResultSet}
>>
>> sc.addJar("/usr/lib/hbase/hbase-protocol.jar")
>> sc.addJar("phoenix-x.y.z-bin/phoenix-core-x.y.z.jar")
>> sc.addJar("phoenix-x.y.z-bin/phoenix-x.y.z-client.jar")
>>
>> def createConnection() = {
>> Class.forName("org.apache.phoenix.jdbc.PhoenixDriver").newInstance();
>> DriverManager.getConnection("jdbc:phoenix:hd101.lps.stage,hd
>> 102.lps.stage,hd103.lps.stage"); // the Zookeeper quorum
>> }
>>
>> def extractValues(r: ResultSet) = {
>> (r.getLong(1),// datum
>> r.getInt(2),  // pg
>> r.getString(3),  // HID
>> 
>>  )
>> }
>>
>> val data = new JdbcRDD(sc, createConnection,
>> "SELECT DATUM, PG, HID,  ..... WHERE DATUM >= ? * 1000  AND DATUM <= ? *
>> 1000 and PG = ",
>> lowerBound = 1364774400, upperBound = 1384774400, numPartitions = 24,
>> mapRow = extractValues)
>>
>> data.count()
>>
>> println(data.collect().toList)
>>
>>
>> 2016-10-07 15:20 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
>>
>>> JIRA on hbase side:
>>> HBASE-16179
>>>
>>> FYI
>>>
>>> On Fri, Oct 7, 2016 at 6:07 AM, Josh Mahonin <jmaho...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> There's an open ticket about this issue here:
>>>> https://issues.apache.org/jira/browse/PHOENIX-
>>>>
>>>> Long story short, Spark changed their API (again), breaking the
>>>> existing integration. I'm not sure the level of effort to get it working
>>>> with Spark 2.0, but based on examples from other projects, it looks like
>>>> there's a fair bit of Maven module work to support both Spark 1.x and Spark
>>>> 2.x concurrently in the same project. Patches are very welcome!
>>>>
>>>> Best,
>>>>
>>>> Josh
>>>>
>>>>
>>>>
>>>> On Fri, Oct 7, 2016 at 8:33 AM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Has anyone managed to read phoenix table in Spark 2 by any chance
>>>>> please?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Accessing phoenix tables in Spark 2

2016-10-07 Thread Josh Mahonin
Hi Mich,

There's an open ticket about this issue here:
https://issues.apache.org/jira/browse/PHOENIX-

Long story short, Spark changed their API (again), breaking the existing
integration. I'm not sure the level of effort to get it working with Spark
2.0, but based on examples from other projects, it looks like there's a
fair bit of Maven module work to support both Spark 1.x and Spark 2.x
concurrently in the same project. Patches are very welcome!

Best,

Josh



On Fri, Oct 7, 2016 at 8:33 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> Has anyone managed to read phoenix table in Spark 2 by any chance please?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: bulk-delete spark phoenix

2016-09-28 Thread Josh Mahonin
Hi Fabio,

You could probably just execute a regular DELETE query from a JDBC call,
which is generally safe to do either from the Spark driver or within an
executor. As long as auto-commit is enabled, it's an entirely server side
operation: https://phoenix.apache.org/language/#delete

Josh

On Wed, Sep 28, 2016 at 2:13 PM, fabio ferrante 
wrote:

> Hi,
>
> I would like to perform a bulk delete to HBase using Apache Phoenix from
> Spark. Using Phoenix-Spark plugin i can successfully perform a bulk load
> using saveToPhoenix method from PhoenixRDD but how i can perform a bulk
> delete? There isn't a deleteFromPhoenix method in PhoenixRDD. Is that
> correct? Implement such method is a trivial task?
>
> Thanks in advance,
>  Fabio.
>


Re: bulk-upsert spark phoenix

2016-09-28 Thread Josh Mahonin
Hi Antonio,

Certainly, a JIRA ticket with a patch would be fantastic.

Thanks!

Josh

On Wed, Sep 28, 2016 at 12:08 PM, Antonio Murgia <antonio.mur...@eng.it>
wrote:

> Thank you very much for your insights Josh, if I decide to develop a small
> Phoenix Library that does, through Spark, what the CSV loader does, I'll
> surely write to the mailing list, or open a Jira, or maybe even open a PR,
> right?
>
> Thank you again
>
> #A.M.
>
> On 09/28/2016 05:10 PM, Josh Mahonin wrote:
>
> Hi Antonio,
>
> You're correct, the phoenix-spark output uses the Phoenix Hadoop
> OutputFormat under the hood, which effectively does a parallel, batch JDBC
> upsert. It should scale depending on the number of Spark executors,
> RDD/DataFrame parallelism, and number of HBase RegionServers, though
> admittedly there's a lot of overhead involved.
>
> The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark.
> It's likely possible to do so, but it's probably a non-trivial amount of
> work. If you're interested in taking it on, I'd start with looking at the
> following classes:
>
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala
>
> Good luck,
>
> Josh
>
> On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia <antonio.mur...@eng.it>
> wrote:
>
>> Hi,
>>
>> I would like to perform a Bulk insert to HBase using Apache Phoenix from
>> Spark. I tried using Apache Spark Phoenix library but, as far as I was
>> able to understand from the code, it looks like it performs a jdbc batch
>> of upserts (am I right?). Instead I want to perform a Bulk load like the
>> one described in this blog post
>> (https://zeyuanxy.github.io/HBase-Bulk-Loading/) but taking advance of
>> the automatic transformation between java/scala types to Bytes.
>>
>> I'm actually using phoenix 4.5.2, therefore I cannot use hive to
>> manipulate the phoenix table, and if it possible i want to avoid to
>> spawn a MR job that reads data from csv
>> (https://phoenix.apache.org/bulk_dataload.html). Actually i just want to
>> do what the csv loader is doing with MR but programmatically with Spark
>> (since the data I want to persist is already loaded in memory).
>>
>> Thank you all!
>>
>>
>
>


Re: bulk-upsert spark phoenix

2016-09-28 Thread Josh Mahonin
Hi Antonio,

You're correct, the phoenix-spark output uses the Phoenix Hadoop
OutputFormat under the hood, which effectively does a parallel, batch JDBC
upsert. It should scale depending on the number of Spark executors,
RDD/DataFrame parallelism, and number of HBase RegionServers, though
admittedly there's a lot of overhead involved.

The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark.
It's likely possible to do so, but it's probably a non-trivial amount of
work. If you're interested in taking it on, I'd start with looking at the
following classes:

https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala

Good luck,

Josh

On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
wrote:

> Hi,
>
> I would like to perform a Bulk insert to HBase using Apache Phoenix from
> Spark. I tried using Apache Spark Phoenix library but, as far as I was
> able to understand from the code, it looks like it performs a jdbc batch
> of upserts (am I right?). Instead I want to perform a Bulk load like the
> one described in this blog post
> (https://zeyuanxy.github.io/HBase-Bulk-Loading/) but taking advance of
> the automatic transformation between java/scala types to Bytes.
>
> I'm actually using phoenix 4.5.2, therefore I cannot use hive to
> manipulate the phoenix table, and if it possible i want to avoid to
> spawn a MR job that reads data from csv
> (https://phoenix.apache.org/bulk_dataload.html). Actually i just want to
> do what the csv loader is doing with MR but programmatically with Spark
> (since the data I want to persist is already loaded in memory).
>
> Thank you all!
>
>


Re: phenix spark Plugin not working for spark 2.0

2016-09-26 Thread Josh Mahonin
Hi Dalin,

It looks like Spark may have gone and broken their API again for Spark 2.0.
Could you file a JIRA ticket please?

Thanks,

Josh

On Mon, Sep 26, 2016 at 1:17 PM, dalin.qin  wrote:

> Hi I'm trying some test with spark 2.0 together with phoenix 4.8 . My
> enviroment is HDP 2.5 , I installed phoenix 4.8 by myself.
>
> I got everything working perfectly under spark 1.6.2
>
> >>> df = sqlContext.read \
> ...   .format("org.apache.phoenix.spark") \
> ...   .option("table", "TABLE1") \
> ...   .option("zkUrl", "namenode:2181:/hbase-unsecure") \
> ...   .load()
>
> >>> df.show()
>
> +---+--+
> | ID|  COL1|
> +---+--+
> |  1|test_row_1|
> |  2|test_row_2|
> +---+--+
>
>
> But I got  error "org.apache.spark.sql.DataFrame not existing "  when
> loading data from phoenix table by using spark 2.0 (I've made sure that
> necessary jar files are in spark classpath) . I checked there is no such
> class in phoenix 4.8 . Can somebody check and update
> https://phoenix.apache.org/phoenix_spark.html for spark 2.0 usage?
>
>
> In [1]: df = sqlContext.read \
>...:   .format("org.apache.phoenix.spark") \
>...:   .option("table", "TABLE1") \
>...:   .option("zkUrl", "namenode:2181:/hbase-unsecure") \
>...:   .load()
> 
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df = sqlContext.read   .format("org.apache.phoenix.spark")
> .option("table", "TABLE1")   .option("zkUrl", "namenode:2181:/hbase-unsecure")
>   .load()
>
> /usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/readwriter.pyc in
> load(self, path, format, schema, **options)
> 151 return self._df(self._jreader.load(
> self._spark._sc._jvm.PythonUtils.toSeq(path)))
> 152 else:
> --> 153 return self._df(self._jreader.load())
> 154
> 155 @since(1.4)
>
> /usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
> 931 answer = self.gateway_client.send_command(command)
> 932 return_value = get_return_value(
> --> 933 answer, self.gateway_client, self.target_id, self.name
> )
> 934
> 935 for temp_arg in temp_args:
>
> /usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.pyc in deco(*a,
> **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
>
> /usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
> 310 raise Py4JJavaError(
> 311 "An error occurred while calling {0}{1}{2}.\n".
> --> 312 format(target_id, ".", name), value)
> 313 else:
> 314 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling o43.load.
> : java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.getDeclaredMethod(Class.java:2128)
> at java.io.ObjectStreamClass.getPrivateMethod(
> ObjectStreamClass.java:1475)
> at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.
> java:72)
> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:498)
> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:472)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.io.ObjectStreamClass.(ObjectStreamClass.java:472)
> at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369)
> at java.io.ObjectOutputStream.writeObject0(
> ObjectOutputStream.java:1134)
> at java.io.ObjectOutputStream.defaultWriteFields(
> ObjectOutputStream.java:1548)
> at java.io.ObjectOutputStream.writeSerialData(
> ObjectOutputStream.java:1509)
> at java.io.ObjectOutputStream.writeOrdinaryObject(
> ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(
> ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(
> ObjectOutputStream.java:348)
> at org.apache.spark.serializer.JavaSerializationStream.
> writeObject(JavaSerializer.scala:43)
> at org.apache.spark.serializer.JavaSerializerInstance.
> serialize(JavaSerializer.scala:100)
> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(
> ClosureCleaner.scala:295)
> at org.apache.spark.util.ClosureCleaner$.org$apache$
> spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
> at org.apache.spark.util.ClosureCleaner$.clean(
> ClosureCleaner.scala:108)
> at 

Re: How to manually generate a salted row key?

2016-09-13 Thread Josh Mahonin
Hi Marica,

Are you able to successfully write your rowkey without salting? If not, it
could be that your 'generateRowKey' function is the culprit.

FWIW, we have some code that does something similar, though we use
'getSaltedKey':

// If salting, we need to prepend an empty byte to 'rowKey', then fill it
if (saltBuckets > 0) {
rowKey = ByteUtil.concat(new byte[1], rowKey);
rowKey = SaltingUtil.getSaltedKey(new ImmutableBytesWritable(rowKey),
saltBuckets);
}

Good luck,

Josh

On Tue, Sep 13, 2016 at 1:59 AM, Marica Tan  wrote:

> Hi,
>
> We have a table created via phoenix with salt bucket, but we're using
> HBase API to insert records since we need to manually set the HBase version
> and I believe that it isn't possible via phoenix.
>
> Our table has a composite key (firstKey varchar, secondKey varchar,
> thirdKey varchar) and when we do a select query with a where condition on
> the firstKey, not all records are retrieved.
>
> We checked the value of the firstKey and it should return 10 records, but
> we're only getting 7.
>
> If we do a where firstKey = 'someValue' we get 7
> If we do a where firstKey like '%someValue' we get 10
>
> So we think the main culprit is the way we generate the row key. Here's
> the code:
>
>  def generateRowKey(nBuckets: Integer, compositeKeys: String*):
> Array[Byte] = {
>
>   val keys = 
> compositeKeys.tail.foldLeft(ArrayBuffer[Array[Byte]](convertToByteArray(compositeKeys.head)))((a,
>  b) => {
> a += QueryConstants.SEPARATOR_BYTE_ARRAY
> a += convertToByteArray(b)
>   })
>
>   val rowKey = ByteUtil.concat(QueryConstants.SEPARATOR_BYTE_ARRAY, 
> keys.toSeq: _*)
>
>   updateSaltingByte(rowKey, nBuckets)
>
>   rowKey
> }
>
> def convertToByteArray(key: String): Array[Byte] = key match {
>   case x if StringUtils.isNotBlank(x) => Bytes.toBytes(x)
>   case _ => ByteUtil.EMPTY_BYTE_ARRAY
> }
>
> def updateSaltingByte(rowKey: Array[Byte], nBuckets: Integer): Unit = {
>   if (nBuckets > 0) {
> rowKey(0) = SaltingUtil.getSaltingByte(rowKey,
>   SaltingUtil.NUM_SALTING_BYTES, rowKey.length - 
> SaltingUtil.NUM_SALTING_BYTES, nBuckets)
>   }
> }
>
>
> Btw, we're using phoenix 4.4
>
>
> Thanks,
> --
> Marica Tan
>


Re: When would/should I use spark with phoenix?

2016-09-13 Thread Josh Mahonin
Hi Dalin,

Thanks for the information, I'm glad to hear that the spark integration is
working well for your use case.

Josh

On Mon, Sep 12, 2016 at 8:15 PM, dalin.qin <dalin...@gmail.com> wrote:

> Hi Josh,
>
> before the project kicked off , we get the idea that hbase is more
> suitable for massive writing rather than batch full table reading(I forgot
> where the idea from ,just some benchmart testing posted in the website
> maybe). So we decide to read hbase only based on primary key for small
> amount of data query request. we store the hbase result in json file either
> as everyday's incremental changes(another benefit from json is you can put
> them in a time based directory so that you could only query part of those
> files), then use spark to read those json files and do the ML model or
> report caculation.
>
> Hope this could help:)
>
> Dalin
>
>
> On Mon, Sep 12, 2016 at 5:36 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Hi Dalin,
>>
>> That's great to hear. Have you also tried reading back those rows through
>> Spark for a larger "batch processing" job? Am curious if you have any
>> experiences or insight there from operating on a large dataset.
>>
>> Thanks!
>>
>> Josh
>>
>> On Mon, Sep 12, 2016 at 10:29 AM, dalin.qin <dalin...@gmail.com> wrote:
>>
>>> Hi ,
>>> I've used phoenix table to store billions of rows , rows are
>>> incrementally insert into phoenix by spark every day and the table was for
>>> instant query from web page by providing primary key . so far so good .
>>>
>>> Thanks
>>> Dalin
>>>
>>> On Mon, Sep 12, 2016 at 10:07 AM, Cheyenne Forbes <
>>> cheyenne.osanu.for...@gmail.com> wrote:
>>>
>>>> Thanks everyone, I will be using phoenix for simple input/output and
>>>> the phoenix_spark plugin (https://phoenix.apache.org/phoenix_spark.html)
>>>> for more complex queries, is that the smart thing?
>>>>
>>>> Regards,
>>>>
>>>> Cheyenne Forbes
>>>>
>>>> Chief Executive Officer
>>>> Avapno Omnitech
>>>>
>>>> Chief Operating Officer
>>>> Avapno Solutions, Co.
>>>>
>>>> Chairman
>>>> Avapno Assets, LLC
>>>>
>>>> Bethel Town P.O
>>>> Westmoreland
>>>> Jamaica
>>>>
>>>> Email: cheyenne.osanu.for...@gmail.com
>>>> Mobile: 876-881-7889
>>>> skype: cheyenne.forbes1
>>>>
>>>>
>>>> On Sun, Sep 11, 2016 at 11:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> w.r.t. Resource Management, Spark also relies on other framework such
>>>>> as YARN or Mesos.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Sep 11, 2016 at 6:31 AM, John Leach <jlea...@gmail.com> wrote:
>>>>>
>>>>>> Spark has a robust execution model with the following features that
>>>>>> are not part of phoenix
>>>>>> * Scalable
>>>>>> * fault tolerance with lineage (Handles large intermediate
>>>>>> results)
>>>>>> * memory management for tasks
>>>>>> * Resource Management (Fair Scheduling)
>>>>>> * Additional SQL Features (Windowing ,etc.)
>>>>>> * Machine Learning Libraries
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> John
>>>>>>
>>>>>> > On Sep 11, 2016, at 2:45 AM, Cheyenne Forbes <
>>>>>> cheyenne.osanu.for...@gmail.com> wrote:
>>>>>> >
>>>>>> > I realized there is a spark plugin for phoenix, any use cases? why
>>>>>> would I use spark with phoenix instead of phoenix by itself?
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: When would/should I use spark with phoenix?

2016-09-12 Thread Josh Mahonin
Hi Dalin,

That's great to hear. Have you also tried reading back those rows through
Spark for a larger "batch processing" job? Am curious if you have any
experiences or insight there from operating on a large dataset.

Thanks!

Josh

On Mon, Sep 12, 2016 at 10:29 AM, dalin.qin  wrote:

> Hi ,
> I've used phoenix table to store billions of rows , rows are incrementally
> insert into phoenix by spark every day and the table was for instant query
> from web page by providing primary key . so far so good .
>
> Thanks
> Dalin
>
> On Mon, Sep 12, 2016 at 10:07 AM, Cheyenne Forbes <
> cheyenne.osanu.for...@gmail.com> wrote:
>
>> Thanks everyone, I will be using phoenix for simple input/output and
>> the phoenix_spark plugin (https://phoenix.apache.org/phoenix_spark.html)
>> for more complex queries, is that the smart thing?
>>
>> Regards,
>>
>> Cheyenne Forbes
>>
>> Chief Executive Officer
>> Avapno Omnitech
>>
>> Chief Operating Officer
>> Avapno Solutions, Co.
>>
>> Chairman
>> Avapno Assets, LLC
>>
>> Bethel Town P.O
>> Westmoreland
>> Jamaica
>>
>> Email: cheyenne.osanu.for...@gmail.com
>> Mobile: 876-881-7889
>> skype: cheyenne.forbes1
>>
>>
>> On Sun, Sep 11, 2016 at 11:07 AM, Ted Yu  wrote:
>>
>>> w.r.t. Resource Management, Spark also relies on other framework such
>>> as YARN or Mesos.
>>>
>>> Cheers
>>>
>>> On Sun, Sep 11, 2016 at 6:31 AM, John Leach  wrote:
>>>
 Spark has a robust execution model with the following features that are
 not part of phoenix
 * Scalable
 * fault tolerance with lineage (Handles large intermediate
 results)
 * memory management for tasks
 * Resource Management (Fair Scheduling)
 * Additional SQL Features (Windowing ,etc.)
 * Machine Learning Libraries


 Regards,
 John

 > On Sep 11, 2016, at 2:45 AM, Cheyenne Forbes <
 cheyenne.osanu.for...@gmail.com> wrote:
 >
 > I realized there is a spark plugin for phoenix, any use cases? why
 would I use spark with phoenix instead of phoenix by itself?


>>>
>>
>


Re: When would/should I use spark with phoenix?

2016-09-11 Thread Josh Mahonin
Just to add to James' comment, they're indeed complementary and it all
comes down to your own use case. Phoenix offers a convenient SQL interface
over HBase, which is capable of doing very fast queries. If you're just
doing insert / retrieval, it's unlikely that Spark will help you much there.

However, if you have requirements to do some of the types of "big data
processing" that Spark excels at, such as graph algorithms or machine
learning, the plugin allows you to access the data in Phoenix+HBase.

Good luck,

Josh

On Sun, Sep 11, 2016 at 11:12 AM, James Taylor 
wrote:

> It's not an either/or with Phoenix and Spark - often companies use both as
> they're very complementary. See this [1] blog for an example. Spark is a
> processing engine while Phoenix+HBase is a database/store. You'll need to
> store your data somewhere.
> Thanks,
> James
>
> [1] http://tech.marinsoftware.com/nosql/digital-advertising-
> storage-on-apache-hbase-and-apache-phoenix/?platform=hootsuite
>
>
> On Sunday, September 11, 2016, Cheyenne Forbes <
> cheyenne.osanu.for...@gmail.com> wrote:
>
>> Thank you. For a project as big as Facebook or Snapschat, would you
>> recommend using Spark or Phoenix for things such as message
>> retrieval/insert, user search, user feeds retrieval/insert, etc. and what
>> are the pros and cons?
>>
>> Regard,
>> Cheyenne
>>
>>
>> On Sun, Sep 11, 2016 at 8:31 AM, John Leach  wrote:
>>
>>> Spark has a robust execution model with the following features that are
>>> not part of phoenix
>>> * Scalable
>>> * fault tolerance with lineage (Handles large intermediate
>>> results)
>>> * memory management for tasks
>>> * Resource Management (Fair Scheduling)
>>> * Additional SQL Features (Windowing ,etc.)
>>> * Machine Learning Libraries
>>>
>>>
>>> Regards,
>>> John
>>>
>>> > On Sep 11, 2016, at 2:45 AM, Cheyenne Forbes <
>>> cheyenne.osanu.for...@gmail.com> wrote:
>>> >
>>> > I realized there is a spark plugin for phoenix, any use cases? why
>>> would I use spark with phoenix instead of phoenix by itself?
>>>
>>>
>>


Re: TableNotFoundException, tableName=SYSTEM.CATALOG with phoenix-spark

2016-08-10 Thread Josh Mahonin
Hi Nathan,

That could very well be the issue, I suspect they're running a local fork
if it's on top of HBase 1.2.

I'm not familiar with the EMR environment, when you use sqlline.py is it
using their own Phoenix JARs or your own? If it's theirs, perhaps the
phoenix-client-spark JAR might be available in the environment as as well.
The 'Phoenix Clients' [1] page suggests that there may be a Phoenix
installation at /home/hadoop/usr/lib/phoenix

Good luck,

Josh

[1]
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-phoenix.html#d0e18597

On Wed, Aug 10, 2016 at 9:07 AM, Nathan Davis <nathan.da...@salesforce.com>
wrote:

> Thanks Josh, I tried that out (adding just the phoenix-client-spark jar to
> CP) and got the same error result.
>
> I wonder if the issue is that I'm running on EMR 5 with HBase 1.2. The
> jars I'm using are copied over from the HBase master because there is no
> 4.7.0-HBase-1.2 set in MVN. Is the phoenix-spark functionality confirmed to
> work in 4.7 against HBase 1.2?
>
>
> On Tue, Aug 9, 2016 at 7:37 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Hi Nathan,
>>
>> That's a new error to me. I've heard some people have had some luck
>> passing the phoenix-spark and phoenix-client JAR using the --jars option,
>> but the recommended procedure is to ensure you're using the
>> *phoenix-client-spark* JAR on the Spark driver and executor classpath
>> from config. [1]
>>
>> As a reference, here's a Docker image with a working configuration as
>> well [2]
>>
>> Good luck,
>>
>> Josh
>>
>> [1] https://phoenix.apache.org/phoenix_spark.html
>> [2] https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark
>>
>> On Tue, Aug 9, 2016 at 2:20 PM, Nathan Davis <nathan.da...@salesforce.com
>> > wrote:
>>
>>> I am trying to create a simple POC of the Spark / Phoenix integration.
>>> The operation is:
>>>
>>> val df = sqlContext.load("org.apache.phoenix.spark", Map("table" ->
>>>> "SIMPLE_TABLE", "zkUrl" -> "some-name:2181"))
>>>
>>>
>>> The error I get from that is:
>>>
>>> org.apache.phoenix.schema.TableNotFoundException: ERROR 1012 (42M03):
>>>> Table undefined. tableName=SYSTEM.CATALOG
>>>
>>> at org.apache.phoenix.query.ConnectionQueryServicesImpl.getAllT
>>>> ableRegions(ConnectionQueryServicesImpl.java:542)
>>>
>>> at org.apache.phoenix.query.ConnectionQueryServicesImpl.checkCl
>>>> ientServerCompatibility(ConnectionQueryServicesImpl.java:1113)
>>>
>>> at org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureT
>>>> ableCreated(ConnectionQueryServicesImpl.java:1033)
>>>
>>> at org.apache.phoenix.query.ConnectionQueryServicesImpl.createT
>>>> able(ConnectionQueryServicesImpl.java:1369)
>>>
>>> at org.apache.phoenix.query.DelegateConnectionQueryServices.cre
>>>> ateTable(DelegateConnectionQueryServices.java:120)
>>>
>>> at org.apache.phoenix.schema.MetaDataClient.createTableInternal
>>>> (MetaDataClient.java:2116)
>>>
>>> at org.apache.phoenix.schema.MetaDataClient.createTable(MetaDat
>>>> aClient.java:828)
>>>
>>> at org.apache.phoenix.compile.CreateTableCompiler$2.execute(Cre
>>>> ateTableCompiler.java:183)
>>>
>>> at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixState
>>>> ment.java:338)
>>>
>>> at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixState
>>>> ment.java:326)
>>>
>>> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>
>>> at org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(Pho
>>>> enixStatement.java:324)
>>>
>>> at org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(Phoen
>>>> ixStatement.java:1326)
>>>
>>> at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call
>>>> (ConnectionQueryServicesImpl.java:2279)
>>>
>>> at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call
>>>> (ConnectionQueryServicesImpl.java:2248)
>>>
>>> at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixC
>>>> ontextExecutor.java:78)
>>>
>>> at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(Co
>>>> nnectionQueryServicesImpl.java:2248)
>>>
>>> at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServ
>>>> ices(PhoenixDriver.java:233)
>>>
>>> at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnecti
>>>> on(PhoenixEmbeddedDriver.java:135)
>>>
>>> at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
>>>
>>> at java.sql.DriverManager.getConnection(DriverManager.java:664)
>>>
>>> at java.sql.DriverManager.getConnection(DriverManager.java:208)
>>>
>>> at org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnecti
>>>> on(ConnectionUtil.java:98)
>>>
>>>
>>> This is in a spark-shell session started with command:
>>>
>>> spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 --jars
>>>> /root/jars/phoenix-spark-4.7.0-HBase-1.2.jar,/root/jars/phoe
>>>> nix-4.7.0-HBase-1.2-client.jar
>>>
>>>
>>>
>>> Using both sqlline.py and hbase shell I can see that SYSTEM.CATALOG
>>> clearly exists and has the table metadata I'd expect.
>>>
>>> What am I doing wrong here?
>>>
>>> Thanks,
>>> -nathan
>>>
>>>
>>>
>>
>


Re: Phoenix spark and dynamic columns

2016-07-27 Thread Josh Mahonin
Hi Paul,

Unfortunately out of the box the Spark integration doesn't support saving
to dynamic columns. It's worth filing a JIRA enhancement over, and if
you're interested in contributing a patch, here's the following spots I
think would need enhancing:

The saving code derives the column names to use with Phoenix from the
DataFrame itself here [1] as `fieldArray`. We would likely need a new
DataFrame parameter to pass in the column list (with dynamic columns
included) here [2]

[1]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala#L32-L35
[2]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L38

The output configuration, which takes care of getting the MapReduce bits
ready for saving, would also need to be updated to support the dynamic
column definitions here [3], and then the 'UPSERT' statement construction
would need to be adjusted to support those as well here [4]

[3]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/ConfigurationUtil.scala#L25-L38
[4]
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java#L259

Thanks,

Josh


On Mon, Jul 25, 2016 at 5:49 PM, Paul Jones  wrote:

> Is it possible to save a dataframe into a table where the columns are
> dynamic?
>
> For instance, I have a loaded a CSV file with header (key, cat1, cat2)
> into a dataframe. All values are strings. I created a table like this:
> create table mytable ("KEY" varchar not null primary key); The code is as
> follows:
>
> val df = sqlContext.read
> .format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", "\t")
> .load("saint.tsv")
>
> df.write
> .format("org.apache.phoenix.spark")
> .mode("overwrite")
> .option("table", "mytable")
> .option("zkUrl", "servier:2181/hbase")
> .save()
>
> The CSV files I process always have a key column but I don’t know what the
> other columns will be until I start processing. The code above fails my
> example unless I create static columns named cat1 and cat2. Can I change
> the save somehow to run an upsert specifying the names/column types thus
> saving into dynamic columns?
>
> Thanks in advance,
> Paul
>
>


Re: NoClassDefFoundError org/apache/hadoop/hbase/HBaseConfiguration

2016-07-06 Thread Josh Mahonin
t; org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> ...
> ... 73 more
> ...
> :16: error: not found: value sqlContext
>  import sqlContext.implicits._
> ^
> :16: error: not found: value sqlContext
>  import sqlContext.sql
>
>
>
>
> On 7/5/16, Josh Mahonin <jmaho...@gmail.com> wrote:
> > Hi Robert,
> >
> > I recommend following up with HDP on this issue.
> >
> > The underlying problem is that the 'phoenix-spark-4.4.0.2.4.0.0-169.jar'
> > they've provided isn't actually a fat client JAR, it's missing many of
> the
> > required dependencies. They might be able to provide the correct JAR for
> > you, but you'd have to check with them. It may also be possible for you
> to
> > manually include all of the necessary JARs on the Spark classpath to
> mimic
> > the fat jar, but that's fairly ugly and time consuming.
> >
> > FWIW, the HDP 2.5 Tech Preview seems to include the correct JAR, though I
> > haven't personally tested it out yet.
> >
> > Good luck,
> >
> > Josh
> >
> > On Tue, Jul 5, 2016 at 2:00 AM, Robert James <srobertja...@gmail.com>
> > wrote:
> >
> >> I'm trying to use Phoenix on Spark, and can't get around this error:
> >>
> >> java.lang.NoClassDefFoundError:
> >> org/apache/hadoop/hbase/HBaseConfiguration
> >> at
> >>
> org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:82)
> >>
> >> DETAILS:
> >> 1. I'm running HDP 2.4.0.0-169
> >> 2. Using phoenix-sqlline, I can access Phoenix perfectly
> >> 3. Using hbase shell, I can access HBase perfectly
> >> 4. I added the following lines to /etc/spark/conf/spark-defaults.conf
> >>
> >> spark.driver.extraClassPath
> >> /usr/hdp/current/phoenix-client/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar
> >> spark.executor.extraClassPath
> >> /usr/hdp/current/phoenix-client/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar
> >>
> >> 5. Steps to reproduce the error:
> >> # spark-shell
> >> ...
> >> scala> import org.apache.phoenix.spark._
> >> import org.apache.phoenix.spark._
> >>
> >> scala> sqlContext.load("org.apache.phoenix.spark", Map("table" ->
> >> "EMAIL_ENRON", "zkUrl" -> "localhost:2181"))
> >> warning: there were 1 deprecation warning(s); re-run with -deprecation
> >> for details
> >> java.lang.NoClassDefFoundError:
> >> org/apache/hadoop/hbase/HBaseConfiguration
> >> at
> >>
> org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:82)
> >>
> >> // Or, this gets the same error
> >> scala> val rdd = sc.phoenixTableAsRDD("EMAIL_ENRON", Seq("MAIL_FROM",
> >> "MAIL_TO"), zkUrl=Some("localhost"))
> >> java.lang.NoClassDefFoundError:
> >> org/apache/hadoop/hbase/HBaseConfiguration
> >> at
> >>
> org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:82)
> >> at
> >>
> org.apache.phoenix.spark.PhoenixRDD.phoenixConf$lzycompute(PhoenixRDD.scala:38)
> >>
> >> 6. I've tried every permutation I can think of, and also spent hours
> >> Googling.  Some times I can get different errors, but always errors.
> >> Interestingly, if I manage to load the HBaseConfiguration class
> >> manually (by specifying classpaths and then import), I get a
> >> "phoenixTableAsRDD is not a member of SparkContext" error.
> >>
> >> How can I use Phoenix from within Spark?  I'm really eager to do so,
> >> but haven't been able to.
> >>
> >> Also: Can someone give me some background on the underlying issues
> >> here? Trial-and-error-plus-google is not exactly high quality
> >> engineering; I'd like to understand the problem better.
> >>
> >
>


Re: Phoenix-Spark: is DataFrame saving a single threaded operation?

2016-07-05 Thread Josh Mahonin
Hi Vamsi,

The DataFrame has an underlying number of partitions associated with it,
which will be processed by however many workers you have in your Spark
cluster.

You can check the number of partitions with:
df.rdd.partitions.size

And you can alter the partitions using:
df.repartition(numPartitions)

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Good luck,

Josh

On Tue, Jul 5, 2016 at 12:01 PM, Vamsi Krishna 
wrote:

> Team,
>
> In Phoenix-Spark plugin is DataFrame save operation single threaded?
>
> df.write \
>   .format("org.apache.phoenix.spark") \
>   .mode("overwrite") \
>   .option("table", "TABLE1") \
>   .option("zkUrl", "localhost:2181") \
>   .save()
>
>
> Thanks,
> Vamsi Attluri
> --
> Vamsi Attluri
>


Re: NoClassDefFoundError org/apache/hadoop/hbase/HBaseConfiguration

2016-07-05 Thread Josh Mahonin
Hi Robert,

I recommend following up with HDP on this issue.

The underlying problem is that the 'phoenix-spark-4.4.0.2.4.0.0-169.jar'
they've provided isn't actually a fat client JAR, it's missing many of the
required dependencies. They might be able to provide the correct JAR for
you, but you'd have to check with them. It may also be possible for you to
manually include all of the necessary JARs on the Spark classpath to mimic
the fat jar, but that's fairly ugly and time consuming.

FWIW, the HDP 2.5 Tech Preview seems to include the correct JAR, though I
haven't personally tested it out yet.

Good luck,

Josh

On Tue, Jul 5, 2016 at 2:00 AM, Robert James  wrote:

> I'm trying to use Phoenix on Spark, and can't get around this error:
>
> java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
> at
> org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:82)
>
> DETAILS:
> 1. I'm running HDP 2.4.0.0-169
> 2. Using phoenix-sqlline, I can access Phoenix perfectly
> 3. Using hbase shell, I can access HBase perfectly
> 4. I added the following lines to /etc/spark/conf/spark-defaults.conf
>
> spark.driver.extraClassPath
> /usr/hdp/current/phoenix-client/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar
> spark.executor.extraClassPath
> /usr/hdp/current/phoenix-client/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar
>
> 5. Steps to reproduce the error:
> # spark-shell
> ...
> scala> import org.apache.phoenix.spark._
> import org.apache.phoenix.spark._
>
> scala> sqlContext.load("org.apache.phoenix.spark", Map("table" ->
> "EMAIL_ENRON", "zkUrl" -> "localhost:2181"))
> warning: there were 1 deprecation warning(s); re-run with -deprecation
> for details
> java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
> at
> org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:82)
>
> // Or, this gets the same error
> scala> val rdd = sc.phoenixTableAsRDD("EMAIL_ENRON", Seq("MAIL_FROM",
> "MAIL_TO"), zkUrl=Some("localhost"))
> java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
> at
> org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:82)
> at
> org.apache.phoenix.spark.PhoenixRDD.phoenixConf$lzycompute(PhoenixRDD.scala:38)
>
> 6. I've tried every permutation I can think of, and also spent hours
> Googling.  Some times I can get different errors, but always errors.
> Interestingly, if I manage to load the HBaseConfiguration class
> manually (by specifying classpaths and then import), I get a
> "phoenixTableAsRDD is not a member of SparkContext" error.
>
> How can I use Phoenix from within Spark?  I'm really eager to do so,
> but haven't been able to.
>
> Also: Can someone give me some background on the underlying issues
> here? Trial-and-error-plus-google is not exactly high quality
> engineering; I'd like to understand the problem better.
>


Re: phoenix spark options not supporint query in dbtable

2016-06-09 Thread Josh Mahonin
Hi Xindian,

The phoenix-spark integration is based on the Phoenix MapReduce layer,
which doesn't support aggregate functions. However, as you mentioned, both
filtering and pruning predicates are pushed down to Phoenix. With an RDD or
DataFrame loaded, all of Spark's various aggregation methods are available
to you.

Although the Spark JDBC data source supports the full complement of
Phoenix's supported queries, the way it achieves parallelism is to split
the query across a number of workers and connections based on a
'partitionColumn' with a 'lowerBound' and 'upperBound', which must be
numeric. If your use case has numeric primary keys, then that is
potentially a good solution for you. [1]

The phoenix-spark parallelism is based on the splits provided by the
Phoenix query planner, and has no requirements on specifying partition
columns or upper/lower bounds. It's up to you to evaluate which technique
is the right method for your use case. [2]

Good luck,

Josh

[1]
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
[2] https://phoenix.apache.org/phoenix_spark.html


On Wed, Jun 8, 2016 at 6:01 PM, Long, Xindian 
wrote:

> The Spark JDBC data source supports to specify a query as the  “dbtable”
> option.
>
> I assume all queries in the above query in pushed down to the database
> instead of done in Spark.
>
>
>
> The  phoenix spark plug in seems not supporting that. Why is that? Any
> plan in the future to support it?
>
>
>
> I know phoenix spark does support an optional select clause and predicate
> push down in some cases, but it is limited.
>
>
>
> Thanks
>
>
>
> Xindian
>
>
>
>
>
> ---
>
> Xindian “Shindian” Long
>
> Mobile:  919-9168651
>
> Email: xindian.l...@gmail.com
>
>
>
>
>
>
>


Re: GenericMutableRow cannot be cast to org.apache.spark.sql.Row

2016-05-15 Thread Josh Mahonin
Hi Radha,

I suggest you create a ticket with Hortonworks for this issue.

I believe the root cause is that the version of Phoenix they have provided
doesn't include all of the necessary patches for Spark 1.6 DataFrame
support.

Good luck,

Josh

On Thu, May 12, 2016 at 3:11 AM, Radha krishna  wrote:

> Hi All,
>
> I am using spark + phoenix combination, after loading the data(using and
> spark+phoenix) I tried to perform some join operations and it is giving the
> below error message. can some one suggest what is the solution for this
> problem
>
> Hadoop Distribution : Hortonworks
> Spark Version : 1.6
> Hbase Version: 1.1.2
> Phoenix Version: 4.4.0
>
> Error
> 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 21
> in stage 0.0 failed 1 times, most recent failure: Lost task 21.0 in stage
> 0.0 (TID 21, localhost): java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast
> to org.apache.spark.sql.Row
> at
> org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
> at
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
> at
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
> at
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
> at
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125)
> at org.apache.spark.sql.DataFrame.org
> $apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1537)
> at org.apache.spark.sql.DataFrame.org
> $apache$spark$sql$DataFrame$$collect(DataFrame.scala:1544)
> at
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1414)
> at
> 

Re: Phoenix-Spark: Number of partitions in PhoenixRDD

2016-04-18 Thread Josh Mahonin
Hi Diego,

The phoenix-spark RDD partition count is equal to the number of splits that
the query planner returns. Adjusting the HBase region splits, table salting
[1], as well as the guidepost width [2] should help with the
parallelization here.

Using 'EXPLAIN' for the generated query in sqlline might be helpful for
debugging here as well.

It would be great if you could update this thread with any lessons learned
as well.

Good luck,

Josh

[1] https://phoenix.apache.org/salted.html
[2] https://phoenix.apache.org/update_statistics.html

On Mon, Apr 18, 2016 at 4:37 AM, Fustes, Diego 
wrote:

> Hi all,
>
>
>
> I'm working with the Phoenix spark plugin to process a HUGE table. The
> table is salted in 100 buckets and is split in 400 regions. When I read it
> with phoenixTableAsRDD, I get a RDD with 150 parititions. These partitions
> are too big, such
>
> that I am getting OutOfMemory problems. Therefore, I would like to get
> smaller partitions. To do this, I could just call repartition, but it would
> shuffle the whole dataset... So, my question is, is there a way to modify
> PhoenixInputFormat
>
> to get more partitions in the resulting RDD?
>
>
>
> Thanks and regards,
>
>
>
> Diego
>
>
>
>
>
>
>
>
>
>
>
> [image: Description: Description: cid:image001.png@01CF4378.72EDFE50]
>
> *NDT GDAC Spain S.L.*
>
> Diego Fustes, Big Data and Machine Learning Expert
>
> Gran Vía de les Corts Catalanes 130, 11th floor
>
> 08038 Barcelona, Spain
>
> Phone: +34 93 43 255 27
>
> diego.fus...@ndt-global.com
>
> *www.ndt-global.com *
>
>
>
> --
> This email is intended only for the recipient(s) designated above.  Any 
> dissemination, distribution, copying, or use of the information contained 
> herein by anyone other than the recipient(s) designated by the sender is 
> unauthorized and strictly prohibited and subject to legal privilege.  If you 
> have received this e-mail in error, please notify the sender immediately and 
> delete and destroy this email.
>
> Der Inhalt dieser E-Mail und deren Anhänge sind vertraulich. Wenn Sie nicht 
> der Adressat sind, informieren Sie bitte den Absender unverzüglich, verwenden 
> Sie den Inhalt nicht und löschen Sie die E-Mail sofort.
>
> NDT Global GmbH and Co. KG,  Friedrich-List-Str. 1, D-76297 Stutensee, Germany
> Registry Court Mannheim
> HRA 704288
>
> Personally liable partner:
> NDT Verwaltungs GmbH
> Friedrich-List-Straße 1, D-76297 Stutensee, Germany
> Registry Court Mannheim
> HRB 714639
> CEO: Gunther Blitz
>
>
>
>
>
>


Re: Spark & Phoenix data load

2016-04-10 Thread Josh Mahonin
Hi Neelesh,

The saveToPhoenix method uses the MapReduce PhoenixOutputFormat under the
hood, which is a wrapper over the JDBC driver. It's likely not as efficient
as the CSVBulkLoader, although there are performance improvements over a
simple JDBC client as the writes are spread across multiple Spark workers
(depending on the number of partitions in the RDD/DataFrame).

Regards,

Josh

On Sun, Apr 10, 2016 at 1:21 AM, Neelesh  wrote:

> Hi ,
>   Does phoenix-spark's saveToPhoenix use the JDBC driver internally, or
> does it do something similar to CSVBulkLoader using HFiles?
>
> Thanks!
>
>


Re: [HELP:]Save Spark Dataframe in Phoenix Table

2016-04-10 Thread Josh Mahonin
Hi Divya,

No, there is a separate JAR that would look like
'phoenix-4.4.0.XXX-client-spark.jar'. If you download a binary release of
Phoenix, or compile the latest version yourself, you will be able to see
and use it. It does not come with the HDP 2.3.4 platform, at least last I
checked.

Regards,

Josh

On Sat, Apr 9, 2016 at 2:24 PM, Divya Gehlot <divya.htco...@gmail.com>
wrote:

> Hi Josh,
> Thank you very much for your help.
> I could see there is  phoenix-spark-4.4.0.2.3.4.0-3485.jar in my
> phoenix/lib.
> Please confirm is the above jar you are talking about?
>
> Thanks,
> Divya
>
> Josh Mahonin <jmahonin@
>
> On 9 April 2016 at 23:01, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Hi Divya,
>>
>> You don't have the phoenix client-spark JAR in your classpath, which is
>> required for the phoenix-spark integration to work (as per the
>> documentation).
>>
>> As well, you aren't using the vanilla Apache project that this mailing
>> list supports, but are using a vendor packaged platform (Hortonworks).
>> Since they maintain their own patches and forks to the upstream Apache
>> versions, in general you should opt for filing support tickets with them
>> first. In this particular case, HDP 2.3.4 doesn't actually provide the
>> necessary phoenix client-spark JAR by default, so your options are limited
>> here. Again, I recommend filing a support ticket with Hortonworks.
>>
>> Regards,
>>
>> Josh
>>
>> On Sat, Apr 9, 2016 at 9:11 AM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> The code which I using to connect to Phoenix for writing
>>> def writeToTable(df: DataFrame,dbtable: String) = {
>>> val phx_properties = collection.immutable.Map[String, String](
>>>  "zkUrl" -> "localhost:2181:/hbase-unsecure",
>>> "table" -> dbtable)
>>>
>>> df.write.format("org.apache.phoenix.spark").mode(SaveMode.Overwrite).options(phx_properties).saveAsTable(dbtable)
>>> }
>>>
>>> While Submitting Spark job
>>> * spark-shell  --properties-file  /TestDivya/Spark/Phoenix.properties
>>> --jars
>>> /usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler-1.2.1.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/phoenix/lib/zookeeper.jar,/usr/hdp/2.3.4.0-3485/phoenix/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/phoenix/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/phoenix/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/phoenix-server.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/phoenix-client-4.4.0.jar
>>>  --driver-class-path
>>> /usr/hdp/2.3.4.0-3485/hbase/lib/phoenix-server.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/phoenix-client-4.4.0.jar
>>> --packages com.databricks:spark-csv_2.10:1.4.0  --master yarn-client  -i
>>> /TestDivya/Spark/WriteToPheonix.scala*
>>>
>>>
>>> Getting the below error :
>>>
>>> org.apache.spark.sql.AnalysisException: 
>>> org.apache.phoenix.spark.DefaultSource
>>> does not allow user-specified schemas.;
>>>
>>> Am I on the right track or missing any properties ?
>>>
>>>  Because of this I am unable to proceed with Phoenix and have to find
>>> alternate options.
>>> Would really appreciate the help
>>>
>>>
>>>
>>>
>>>
>>> -- Forwarded message --
>>> From: Divya Gehlot <divya.htco...@gmail.com>
>>> Date: 8 April 2016 at 19:54
>>> Subject: Re: [HELP:]Save Spark Dataframe in Phoenix Table
>>> To: Josh Mahonin <jmaho...@gmail.com>
>>>
>>>
>>> Hi Josh,
>>> I am doing in the same manner as mentioned in Phoenix Spark manner.
>>> Using the latest version of HDP 2.3.4 .
>>> In case of version mismatch/lack of spark Phoenix support it's should
>>> have thrown the error at read also.
>>> Which is working fine as expected .
>>> Will surely pass on the code snippets once I log on to my System.
>>> In the mean while I would like to know the zkURL parameter.If I build it
>>> with HbaseConfiguration and passing zk quorom ,znode and port .
>>> It throws error for example localhost :2181/hbase-unsecure
>>> This localhost gets replaced by all the quorom
>>> Like quorum1,quorum2:2181/hbase-unsecure
>>>
>>> I am just providing the IP address of my HBase master.
>>>
>>> I feel like I am  not on right track so asked for the help .
>>> How to connect to Phoenix through Spark on hadoop cluster .
>>> Thanks for the help.
>>> Cheers,
&

Re: Spark Plugin Information

2016-04-10 Thread Josh Mahonin
Hi Ben,

I have no way of verifying myself, but if that is the same URL you use for
Zookeeper quorum, and you've verified that you can connect using the
regular JDBC client, I would expect that to work for you as well.

Perhaps start off smaller with a 1-node test cluster, verify your
configuration, then move up from there?

Josh

On Sat, Apr 9, 2016 at 3:09 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Josh,
>
> For my tests, I’m passing the Zookeeper Quorum URL.
>
> "zkUrl" -> "prod-nj3-hbase-master001.pnj3i.gradientx.com,
> prod-nj3-namenode001.pnj3i.gradientx.com,
> prod-nj3-namenode002.pnj3i.gradientx.com:2181”
>
> Is this correct?
>
> Thanks,
> Ben
>
>
> On Apr 9, 2016, at 8:06 AM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
> Hi Ben,
>
> It looks like a connection URL issue. Are you passing the correct 'zkUrl'
> parameter, or do you have the HBase Zookeeper quorum defined in an
> hbase-site.xml available in the classpath?
>
> If you're able to connect to Phoenix using JDBC, you should be able to
> take the JDBC url, pop off the 'jdbc:phoenix:' prefix and use it as the
> 'zkUrl' option.
>
> Josh
>
> On Fri, Apr 8, 2016 at 6:47 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Hi Josh,
>>
>> I am using CDH 5.5.2 with HBase 1.0.0, Phoenix 4.5.2, and Spark 1.6.0. I
>> looked up the error and found others who led me to ask the question. I’ll
>> try to use Phoenix 4.7.0 client jar and see what happens.
>>
>> The error I am getting is:
>>
>> java.sql.SQLException: ERROR 103 (08004): Unable to establish connection.
>> at
>> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:388)
>> at
>> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:296)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:179)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1917)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1896)
>> at
>> org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1896)
>> at
>> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:180)
>> at
>> org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:132)
>> at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:151)
>> at java.sql.DriverManager.getConnection(DriverManager.java:571)
>> at java.sql.DriverManager.getConnection(DriverManager.java:187)
>> at
>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:93)
>> at
>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:57)
>> at
>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
>> at
>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:280)
>> at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:101)
>> at
>> org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:57)
>> at
>> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
>> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
>> at $iwC$$iwC$$iwC$$iwC.(:48)
>> at $iwC$$iwC$$iwC.(:50)
>> at $iwC$$iwC.(:52)
>> at $iwC.(:54)
>> at (:56)
>> at .(:60)
>> at .()
>> at .(:7)
>> at .()
>> at $print()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(S

Re: Spark Plugin Information

2016-04-09 Thread Josh Mahonin
; at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
> at
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
> at
> org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:414)
> at
> org.apache.hadoop.hbase.client.ConnectionManager.createConnectionInternal(ConnectionManager.java:323)
> at
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:144)
> at
> org.apache.phoenix.query.HConnectionFactory$HConnectionFactoryImpl.createConnection(HConnectionFactory.java:47)
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:294)
> ... 73 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
> ... 78 more
> Caused by: java.lang.UnsupportedOperationException: Unable to find
> org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
> at
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:36)
> at
> org.apache.hadoop.hbase.ipc.RpcControllerFactory.instantiate(RpcControllerFactory.java:58)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.createAsyncProcess(ConnectionManager.java:2317)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:688)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:630)
> ... 83 more
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:190)
> at
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:32)
> ... 87 more
>
> Thanks,
> Ben
>
>
> On Apr 8, 2016, at 2:23 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
> Hi Ben,
>
> If you have a reproducible test case, please file a JIRA for it. The
> documentation (https://phoenix.apache.org/phoenix_spark.html) is accurate
> and verified for up to Phoenix 4.7.0 and Spark 1.6.0.
>
> Although not supported by the Phoenix project at large, you may f

Re: [HELP:]Save Spark Dataframe in Phoenix Table

2016-04-09 Thread Josh Mahonin
Hi Divya,

You don't have the phoenix client-spark JAR in your classpath, which is
required for the phoenix-spark integration to work (as per the
documentation).

As well, you aren't using the vanilla Apache project that this mailing list
supports, but are using a vendor packaged platform (Hortonworks). Since
they maintain their own patches and forks to the upstream Apache versions,
in general you should opt for filing support tickets with them first. In
this particular case, HDP 2.3.4 doesn't actually provide the necessary
phoenix client-spark JAR by default, so your options are limited here.
Again, I recommend filing a support ticket with Hortonworks.

Regards,

Josh

On Sat, Apr 9, 2016 at 9:11 AM, Divya Gehlot <divya.htco...@gmail.com>
wrote:

> Hi,
> The code which I using to connect to Phoenix for writing
> def writeToTable(df: DataFrame,dbtable: String) = {
> val phx_properties = collection.immutable.Map[String, String](
>  "zkUrl" -> "localhost:2181:/hbase-unsecure",
> "table" -> dbtable)
>
> df.write.format("org.apache.phoenix.spark").mode(SaveMode.Overwrite).options(phx_properties).saveAsTable(dbtable)
> }
>
> While Submitting Spark job
> * spark-shell  --properties-file  /TestDivya/Spark/Phoenix.properties
> --jars
> /usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler-1.2.1.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/phoenix/lib/zookeeper.jar,/usr/hdp/2.3.4.0-3485/phoenix/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/phoenix/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/phoenix/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/phoenix-server.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/phoenix-client-4.4.0.jar
>  --driver-class-path
> /usr/hdp/2.3.4.0-3485/hbase/lib/phoenix-server.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/phoenix-client-4.4.0.jar
> --packages com.databricks:spark-csv_2.10:1.4.0  --master yarn-client  -i
> /TestDivya/Spark/WriteToPheonix.scala*
>
>
> Getting the below error :
>
> org.apache.spark.sql.AnalysisException: org.apache.phoenix.spark.DefaultSource
> does not allow user-specified schemas.;
>
> Am I on the right track or missing any properties ?
>
>  Because of this I am unable to proceed with Phoenix and have to find
> alternate options.
> Would really appreciate the help
>
>
>
>
>
> -- Forwarded message --
> From: Divya Gehlot <divya.htco...@gmail.com>
> Date: 8 April 2016 at 19:54
> Subject: Re: [HELP:]Save Spark Dataframe in Phoenix Table
> To: Josh Mahonin <jmaho...@gmail.com>
>
>
> Hi Josh,
> I am doing in the same manner as mentioned in Phoenix Spark manner.
> Using the latest version of HDP 2.3.4 .
> In case of version mismatch/lack of spark Phoenix support it's should have
> thrown the error at read also.
> Which is working fine as expected .
> Will surely pass on the code snippets once I log on to my System.
> In the mean while I would like to know the zkURL parameter.If I build it
> with HbaseConfiguration and passing zk quorom ,znode and port .
> It throws error for example localhost :2181/hbase-unsecure
> This localhost gets replaced by all the quorom
> Like quorum1,quorum2:2181/hbase-unsecure
>
> I am just providing the IP address of my HBase master.
>
> I feel like I am  not on right track so asked for the help .
> How to connect to Phoenix through Spark on hadoop cluster .
> Thanks for the help.
> Cheers,
> Divya
> On Apr 8, 2016 7:06 PM, "Josh Mahonin" <jmaho...@gmail.com> wrote:
>
>> Hi Divya,
>>
>> That's strange. Are you able to post a snippet of your code to look at?
>> And are you sure that you're saving the dataframes as per the docs (
>> https://phoenix.apache.org/phoenix_spark.html)?
>>
>> Depending on your HDP version, it may or may not actually have
>> phoenix-spark support. Double-check that your Spark configuration is setup
>> with the right worker/driver classpath settings. and that the phoenix JARs
>> contain the necessary phoenix-spark classes
>> (e.g. org.apache.phoenix.spark.PhoenixRelation). If not, I suggest
>> following up with Hortonworks.
>>
>> Josh
>>
>>
>>
>> On Fri, Apr 8, 2016 at 1:22 AM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I hava a Hortonworks Hadoop cluster having below Configurations :
>>> Spark 1.5.2
>>> HBASE 1.1.x
>>> Phoenix 4.4
>>>
>>> I am able to connect to Phoenix through JDBC connection and able to read
>>> the Phoenix tables .
>>> But while writing the data back to Phoenix table
>>> I am getting below error :
>>>
>>> org.apache.spark.sql.AnalysisException:
>>> org.apache.phoenix.spark.DefaultSource does not allow user-specified
>>> schemas.;
>>>
>>> Can any body help in resolving the above errors or any other solution of
>>> saving Spark Dataframes to Phoenix.
>>>
>>> Would really appareciate the help.
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>


Re: [HELP:]Save Spark Dataframe in Phoenix Table

2016-04-08 Thread Josh Mahonin
Hi Divya,

That's strange. Are you able to post a snippet of your code to look at? And
are you sure that you're saving the dataframes as per the docs (
https://phoenix.apache.org/phoenix_spark.html)?

Depending on your HDP version, it may or may not actually have
phoenix-spark support. Double-check that your Spark configuration is setup
with the right worker/driver classpath settings. and that the phoenix JARs
contain the necessary phoenix-spark classes
(e.g. org.apache.phoenix.spark.PhoenixRelation). If not, I suggest
following up with Hortonworks.

Josh



On Fri, Apr 8, 2016 at 1:22 AM, Divya Gehlot 
wrote:

> Hi,
> I hava a Hortonworks Hadoop cluster having below Configurations :
> Spark 1.5.2
> HBASE 1.1.x
> Phoenix 4.4
>
> I am able to connect to Phoenix through JDBC connection and able to read
> the Phoenix tables .
> But while writing the data back to Phoenix table
> I am getting below error :
>
> org.apache.spark.sql.AnalysisException:
> org.apache.phoenix.spark.DefaultSource does not allow user-specified
> schemas.;
>
> Can any body help in resolving the above errors or any other solution of
> saving Spark Dataframes to Phoenix.
>
> Would really appareciate the help.
>
> Thanks,
> Divya
>


Re: Phoenix DB Migration with Flyway

2016-03-08 Thread Josh Mahonin
Tweeted!
https://twitter.com/jmahonin/status/707229582714220545

On Tue, Mar 8, 2016 at 10:19 AM, James Taylor <jamestay...@apache.org>
wrote:

> Awesome work, Josh. Thanks for letting us know - how about a tweet with an
> @ mention of flywaydb and ApachePhoenix to help spread the word further?
>
> James
>
>
> On Tuesday, March 8, 2016, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Hi all,
>>
>> Just thought I'd let you know that Flyway 4.0 was recently released,
>> which includes support for DB migrations with Phoenix.
>>
>> https://flywaydb.org/blog/flyway-4.0
>>
>> Josh
>>
>


Phoenix DB Migration with Flyway

2016-03-08 Thread Josh Mahonin
Hi all,

Just thought I'd let you know that Flyway 4.0 was recently released, which
includes support for DB migrations with Phoenix.

https://flywaydb.org/blog/flyway-4.0

Josh


Re: Spark Phoenix Plugin

2016-02-20 Thread Josh Mahonin
Hi Ben,

Can you describe in more detail what your environment is? Are you using
stock installs of HBase, Spark and Phoenix? Are you using the hadoop2.4
pre-built Spark distribution as per the documentation [1]?

The unread block data error is commonly traced back to this issue [2] which
indicates some sort of mismatched version problem..

Thanks,

Josh

[1] https://phoenix.apache.org/phoenix_spark.html
[2] https://issues.apache.org/jira/browse/SPARK-1867

On Fri, Feb 19, 2016 at 2:18 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Hi Josh,
>
> When I run the following code in spark-shell for spark 1.6:
>
> import org.apache.phoenix.spark._
> val df = sqlContext.load("org.apache.phoenix.spark", Map("table" ->
> "TEST.MY_TEST", "zkUrl" -> “zk1,zk2,zk3:2181"))
> df.select(df("ID")).show()
>
> I get this error:
>
> java.lang.IllegalStateException: unread block data
>
> Thanks,
> Ben
>
>
> On Feb 19, 2016, at 11:12 AM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
> What specifically doesn't work for you?
>
> I have a Docker image that I used to do some basic testing on it with and
> haven't run into any problems:
> https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark
>
> On Fri, Feb 19, 2016 at 12:40 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> All,
>>
>> Thanks for the help. I have switched out Cloudera’s HBase 1.0.0 with the
>> current Apache HBase 1.1.3. Also, I installed Phoenix 4.7.0, and everything
>> works fine except for the Phoenix Spark Plugin. I wonder if it’s a version
>> incompatibility issue with Spark 1.6. Has anyone tried compiling 4.7.0
>> using Spark 1.6?
>>
>> Thanks,
>> Ben
>>
>> On Feb 12, 2016, at 6:33 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>> Anyone know when Phoenix 4.7 will be officially released? And what
>> Cloudera distribution versions will it be compatible with?
>>
>> Thanks,
>> Ben
>>
>> On Feb 10, 2016, at 11:03 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>> Hi Pierre,
>>
>> I am getting this error now.
>>
>> Error: org.apache.phoenix.exception.PhoenixIOException:
>> org.apache.hadoop.hbase.DoNotRetryIOException:
>> SYSTEM.CATALOG,,1453397732623.8af7b44f3d7609eb301ad98641ff2611.:
>> org.apache.hadoop.hbase.client.Delete.setAttribute(Ljava/lang/String;[B)Lorg/apache/hadoop/hbase/client/Delete;
>>
>> I even tried to use sqlline.py to do some queries too. It resulted in the
>> same error. I followed the installation instructions. Is there something
>> missing?
>>
>> Thanks,
>> Ben
>>
>>
>> On Feb 9, 2016, at 10:20 AM, Ravi Kiran <maghamraviki...@gmail.com>
>> wrote:
>>
>> Hi Pierre,
>>
>>   Try your luck for building the artifacts from
>> https://github.com/chiastic-security/phoenix-for-cloudera. Hopefully it
>> helps.
>>
>> Regards
>> Ravi .
>>
>> On Tue, Feb 9, 2016 at 10:04 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Hi Pierre,
>>>
>>> I found this article about how Cloudera’s version of HBase is very
>>> different than Apache HBase so it must be compiled using Cloudera’s repo
>>> and versions. But, I’m not having any success with it.
>>>
>>>
>>> http://stackoverflow.com/questions/31849454/using-phoenix-with-cloudera-hbase-installed-from-repo
>>>
>>> There’s also a Chinese site that does the same thing.
>>>
>>> https://www.zybuluo.com/xtccc/note/205739
>>>
>>> I keep getting errors like the one’s below.
>>>
>>> [ERROR]
>>> /opt/tools/phoenix/phoenix-core/src/main/java/org/apache/hadoop/hbase/regionserver/LocalIndexMerger.java:[110,29]
>>> cannot find symbol
>>> [ERROR] symbol:   class Region
>>> [ERROR] location: class
>>> org.apache.hadoop.hbase.regionserver.LocalIndexMerger
>>> …
>>>
>>> Have you tried this also?
>>>
>>> As a last resort, we will have to abandon Cloudera’s HBase for Apache’s
>>> HBase.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Feb 8, 2016, at 11:04 PM, pierre lacave <pie...@lacave.me> wrote:
>>>
>>> Havent met that one.
>>>
>>> According to SPARK-1867, the real issue is hidden.
>>>
>>> I d process by elimination, maybe try in local[*] mode first
>>>
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-1867
>>>
>>> On Tue, 9 Feb 2

Re: Spark Phoenix Plugin

2016-02-19 Thread Josh Mahonin
What specifically doesn't work for you?

I have a Docker image that I used to do some basic testing on it with and
haven't run into any problems:
https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark

On Fri, Feb 19, 2016 at 12:40 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> All,
>
> Thanks for the help. I have switched out Cloudera’s HBase 1.0.0 with the
> current Apache HBase 1.1.3. Also, I installed Phoenix 4.7.0, and everything
> works fine except for the Phoenix Spark Plugin. I wonder if it’s a version
> incompatibility issue with Spark 1.6. Has anyone tried compiling 4.7.0
> using Spark 1.6?
>
> Thanks,
> Ben
>
> On Feb 12, 2016, at 6:33 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
> Anyone know when Phoenix 4.7 will be officially released? And what
> Cloudera distribution versions will it be compatible with?
>
> Thanks,
> Ben
>
> On Feb 10, 2016, at 11:03 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
> Hi Pierre,
>
> I am getting this error now.
>
> Error: org.apache.phoenix.exception.PhoenixIOException:
> org.apache.hadoop.hbase.DoNotRetryIOException:
> SYSTEM.CATALOG,,1453397732623.8af7b44f3d7609eb301ad98641ff2611.:
> org.apache.hadoop.hbase.client.Delete.setAttribute(Ljava/lang/String;[B)Lorg/apache/hadoop/hbase/client/Delete;
>
> I even tried to use sqlline.py to do some queries too. It resulted in the
> same error. I followed the installation instructions. Is there something
> missing?
>
> Thanks,
> Ben
>
>
> On Feb 9, 2016, at 10:20 AM, Ravi Kiran <maghamraviki...@gmail.com> wrote:
>
> Hi Pierre,
>
>   Try your luck for building the artifacts from
> https://github.com/chiastic-security/phoenix-for-cloudera. Hopefully it
> helps.
>
> Regards
> Ravi .
>
> On Tue, Feb 9, 2016 at 10:04 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Hi Pierre,
>>
>> I found this article about how Cloudera’s version of HBase is very
>> different than Apache HBase so it must be compiled using Cloudera’s repo
>> and versions. But, I’m not having any success with it.
>>
>>
>> http://stackoverflow.com/questions/31849454/using-phoenix-with-cloudera-hbase-installed-from-repo
>>
>> There’s also a Chinese site that does the same thing.
>>
>> https://www.zybuluo.com/xtccc/note/205739
>>
>> I keep getting errors like the one’s below.
>>
>> [ERROR]
>> /opt/tools/phoenix/phoenix-core/src/main/java/org/apache/hadoop/hbase/regionserver/LocalIndexMerger.java:[110,29]
>> cannot find symbol
>> [ERROR] symbol:   class Region
>> [ERROR] location: class
>> org.apache.hadoop.hbase.regionserver.LocalIndexMerger
>> …
>>
>> Have you tried this also?
>>
>> As a last resort, we will have to abandon Cloudera’s HBase for Apache’s
>> HBase.
>>
>> Thanks,
>> Ben
>>
>>
>> On Feb 8, 2016, at 11:04 PM, pierre lacave <pie...@lacave.me> wrote:
>>
>> Havent met that one.
>>
>> According to SPARK-1867, the real issue is hidden.
>>
>> I d process by elimination, maybe try in local[*] mode first
>>
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-1867
>>
>> On Tue, 9 Feb 2016, 04:58 Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Pierre,
>>>
>>> I got it to work using phoenix-4.7.0-HBase-1.0-client-spark.jar. But,
>>> now, I get this error:
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>> 0.0 (TID 3, prod-dc1-datanode151.pdc1i.gradientx.com):
>>> java.lang.IllegalStateException: unread block data
>>>
>>> It happens when I do:
>>>
>>> df.show()
>>>
>>> Getting closer…
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>>
>>> On Feb 8, 2016, at 2:57 PM, pierre lacave <pie...@lacave.me> wrote:
>>>
>>> This is the wrong client jar try with the one named
>>> phoenix-4.7.0-HBase-1.1-client-spark.jar
>>>
>>> On Mon, 8 Feb 2016, 22:29 Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> I tried again by putting the settings within the spark-default.conf.
>>>>
>>>>
>>>> spark.driver.extraClassPath=/opt/tools/phoenix/phoenix-4.7.0-HBase-1.0-client.jar
>>>>
>>>> spark.executor.extraClassPath=/opt/tools/phoenix/phoenix-4.7.0-HBase-1.0-client.jar
>>>>
>>>> I still get the same error using

Re: Save dataframe to Phoenix

2016-02-17 Thread Josh Mahonin
Hi Krishna,

There was some talk a few weeks ago about a new feature to allow creating /
saving to tables dynamically, with schema inferred from the DataFrame.
However, I don't believe a JIRA has been filed for it yet.

As always, pull requests are appreciated.

Josh

On Tue, Feb 16, 2016 at 6:16 PM, Krishna  wrote:

> According Phoenix-Spark plugin docs, only SaveMode.Overwrite is supported
> for saving dataframes to Phoenix table.
>
> Are there any plans to support other save modes (append, ignore) anytime
> soon? Only having overwrite option makes it useful for a small number of
> use-cases.
>


Re: Spark Phoenix Plugin

2016-02-08 Thread Josh Mahonin
Hi Ben,

I'm not sure about the format of those command line options you're passing.
I've had success with spark-shell just by setting the
'spark.executor.extraClassPath' and 'spark.driver.extraClassPath' options
on the spark config, as per the docs [1].

I'm not sure if there's anything special needed for CDH or not though. I
also have a docker image I've been toying with which has a working
Spark/Phoenix setup using the Phoenix 4.7.0 RC and Spark 1.6.0. It might be
a useful reference for you as well [2].

Good luck,

Josh

[1] https://phoenix.apache.org/phoenix_spark.html
[2] https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark

On Mon, Feb 8, 2016 at 4:29 PM, Benjamin Kim  wrote:

> Hi Pierre,
>
> I tried to run in spark-shell using spark 1.6.0 by running this:
>
> spark-shell --master yarn-client --driver-class-path
> /opt/tools/phoenix/phoenix-4.7.0-HBase-1.0-client.jar --driver-java-options
> "-Dspark.executor.extraClassPath=/opt/tools/phoenix/phoenix-4.7.0-HBase-1.0-client.jar”
>
> The version of HBase is the one in CDH5.4.8, which is 1.0.0-cdh5.4.8.
>
> When I get to the line:
>
> val df = sqlContext.load("org.apache.phoenix.spark", Map("table" ->
> “TEST.MY_TEST", "zkUrl" -> “zk1,zk2,zk3:2181”))
>
> I get this error:
>
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.spark.rdd.RDDOperationScope$
>
> Any ideas?
>
> Thanks,
> Ben
>
>
> On Feb 5, 2016, at 1:36 PM, pierre lacave  wrote:
>
> I don't know when the full release will be, RC1 just got pulled out, and
> expecting RC2 soon
>
> you can find them here
>
> https://dist.apache.org/repos/dist/dev/phoenix/
>
>
> there is a new phoenix-4.7.0-HBase-1.1-client-spark.jar that is all you
> need to have in spark classpath
>
>
> *Pierre Lacave*
> 171 Skellig House, Custom House, Lower Mayor street, Dublin 1, Ireland
> Phone :   +353879128708
>
> On Fri, Feb 5, 2016 at 9:28 PM, Benjamin Kim  wrote:
>
>> Hi Pierre,
>>
>> When will I be able to download this version?
>>
>> Thanks,
>> Ben
>>
>>
>> On Friday, February 5, 2016, pierre lacave  wrote:
>>
>>> This was addressed in Phoenix 4.7 (currently in RC)
>>> https://issues.apache.org/jira/browse/PHOENIX-2503
>>>
>>>
>>>
>>>
>>> *Pierre Lacave*
>>> 171 Skellig House, Custom House, Lower Mayor street, Dublin 1, Ireland
>>> Phone :   +353879128708
>>>
>>> On Fri, Feb 5, 2016 at 6:17 PM, Benjamin Kim  wrote:
>>>
 I cannot get this plugin to work in CDH 5.4.8 using Phoenix 4.5.2 and
 Spark 1.6. When I try to launch spark-shell, I get:

 java.lang.RuntimeException: java.lang.RuntimeException: Unable
 to instantiate 
 org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

 I continue on and run the example code. When I get tot the line below:

 val df = sqlContext.load("org.apache.phoenix.spark",
 Map("table" -> "TEST.MY_TEST", "zkUrl" ->
 "zookeeper1,zookeeper2,zookeeper3:2181")

 I get this error:

 java.lang.NoSuchMethodError:
 com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;

 Can someone help?

 Thanks,
 Ben
>>>
>>>
>>>
>
>


Re: Phoenix and Tableau

2016-01-28 Thread Josh Mahonin
Hey Thomas,

That's pretty neat if I read that right. You're able to use Tableau with
Phoenix using the Phoenix-Spark integration?

Thanks!

Josh

On Thu, Jan 28, 2016 at 2:31 PM, Thomas Decaux  wrote:

> Yeah me too  :/ i tried  Spark , it works fine with Tableau on Mac.
>
> You should give a try!
> Le 28 janv. 2016 8:30 PM, "Aaron Bossert"  a écrit :
>
>> Nice!  It's a start...unfortunately, I use the OS X verSion.
>>
>> --
>> Aaron
>>
>> On Jan 28, 2016, at 2:26 PM, Thomas Decaux  wrote:
>>
>> They said only for  windows OS
>> Le 28 janv. 2016 6:36 PM, "Aaron Bossert"  a écrit :
>>
>>> Sorry for butting in, but do you mean that tableau supports JDBC
>>> drivers?  I have wanted to connect Phoenix to tableau for some time now as
>>> well, but have not seen any documentation from tableau to suggest that they
>>> now support JDBC drivers.  Just references to using a JDBC-ODBC bridge
>>> driver, which all discussions I have seen related to that have had very
>>> negative outcomes.
>>>
>>> --
>>> Aaron
>>>
>>> On Jan 28, 2016, at 12:14 PM, Thomas Decaux  wrote:
>>>
>>> You can use jdbc driver already, also, you could  use Spark as a proxy
>>> between.
>>> Le 28 janv. 2016 5:47 PM, "Riesland, Zack"  a
>>> écrit :
>>>
 Hey folks,



 Everything I’ve read online about connecting Phoenix and Tableau is at
 least a year old.



 Has there been any progress on an ODBC driver?



 Any simple hacks to accomplish this?



 Thanks!



>>>


Re: phoenix-spark and pyspark

2016-01-24 Thread Josh Mahonin
Hey Nick,

Hopefully PHOENIX-2599 works out for you, but it's entirely possible
there's some other issues that crop up. My usage patterns are probably a
bit unique, in that I generally do a few JDBC UPSERT SELECT aggregations
from my "hot" tables to intermediate ones before Spark ever involved. That
likely hides an entire class of bugs, like the kind PHOENIX-2599 fixes.

However, as you're likely aware, the Spark integration is a pretty thin
wrapper over the regular MR integration, including Pig, so hopefully
there's not too much in the way of unexplored territory there in terms of
bulk loading and saving.

Good luck!

Josh

On Sat, Jan 23, 2016 at 9:12 PM, Nick Dimiduk <ndimi...@apache.org> wrote:

> That looks about right. I was unaware of this patch; thanks!
>
> On Fri, Jan 22, 2016 at 5:10 PM, James Taylor <jamestay...@apache.org>
> wrote:
>
>> FYI, Nick - do you know about Josh's fix for PHOENIX-2599? Does that help
>> here?
>>
>> On Fri, Jan 22, 2016 at 4:32 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>>
>>> On Thu, Jan 21, 2016 at 7:36 AM, Josh Mahonin <jmaho...@gmail.com>
>>> wrote:
>>>
>>>> Amazing, good work.
>>>>
>>>
>>> All I did was consume your code and configure my cluster. Thanks though
>>> :)
>>>
>>> FWIW, I've got a support case in with Hortonworks to get the
>>>> phoenix-spark integration working out of the box. Assuming it gets
>>>> resolved, that'll hopefully help keep these classpath-hell issues to a
>>>> minimum going forward.
>>>>
>>>
>>> That's fine for HWX customers, but it doesn't solve the general problem
>>> community-side. Unless, of course, we assume the only Phoenix users are on
>>> HDP.
>>>
>>> Interesting point re: PHOENIX-2535. Spark does offer builds for specific
>>>> Hadoop versions, and also no Hadoop at all (with the assumption you'll
>>>> provide the necessary JARs). Phoenix is pretty tightly coupled with its own
>>>> HBase (and by extension, Hadoop) versions though... do you think it be
>>>> possible to work around (2) if you locally added the HDP Maven repo and
>>>> adjusted versions accordingly? I've had some success with that in other
>>>> projects, though as I recall when I tried it with Phoenix I ran into a snag
>>>> trying to resolve some private transitive dependency of Hadoop.
>>>>
>>>
>>> I could specify Hadoop version and build Phoenix locally to remove the
>>> issue. It would work for me because I happen to be packaging my own Phoenix
>>> this week, but it doesn't help for the Apache releases. Phoenix _is_
>>> tightly coupled to HBase versions, but I don't think HBase versions are
>>> that tightly coupled to Hadoop versions. We build HBase 1.x releases
>>> against a long list of Hadoop releases as part of our usual build. I think
>>> what HBase consumes re: Hadoop API's is pretty limited.
>>>
>>> Now that you have Spark working you can start hitting real bugs! If you
>>>> haven't backported the full patchset, you might want to take a look at the
>>>> phoenix-spark history [1], there's been a lot of churn there, especially
>>>> with regards to the DataFrame API.
>>>>
>>>
>>> Yeah, speaking of which... :)
>>>
>>> I find this integration is basically unusable when the underlying HBase
>>> table partitions are in flux; i.e., when you have data being loaded
>>> concurrent to query. Spark RDD's assume stable partitions, multiple queries
>>> against a single DataFrame, or even a single query/DataFrame with lots of
>>> underlying splits and not enough workers, will inevitably fail
>>> with ConcurrentModificationException like [0].
>>>
>>> I think in HBase/MR integration, we define split points for the job as
>>> region boundaries initially, but then we're just using the regular API, so
>>> repartitions are handled transiently by the client layer. I need to dig
>>> into this Phoenix/Spark stuff to see if we can do anything similar here.
>>>
>>> Thanks again Josh,
>>> -n
>>>
>>> [0]:
>>>
>>> java.lang.RuntimeException: java.util.ConcurrentModificationException
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:125)
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(PhoenixInputFormat.java:69)
>>> at
>>> org.apache.spark.rdd.

Re: phoenix-spark and pyspark

2016-01-21 Thread Josh Mahonin
t;> Caused by: java.lang.NumberFormatException: For input string: "e07"
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>> at java.lang.Long.parseLong(Long.java:589)
>> at java.lang.Long.parseLong(Long.java:631)
>> at
>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>> at
>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>> ... 12 more
>>
>> [1]:
>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=a...@mail.gmail.com%3E
>>
>> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>>
>>> That's great to hear. Looking forward to the doc patch!
>>>
>>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <ndimi...@apache.org>
>>> wrote:
>>>
>>>> Josh -- I deployed my updated phoenix build across the cluster, added
>>>> the phoenix-client-spark.jar to configs on the whole cluster, and now basic
>>>> dataframe access is now working. Let me see about updating the docs page to
>>>> be more clear, I'll send a patch by you for review.
>>>>
>>>> Thanks a lot for the help!
>>>> -n
>>>>
>>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jmaho...@gmail.com>
>>>> wrote:
>>>>
>>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on
>>>>> YARN as well. I suppose the JAR is probably shipped by YARN, though I 
>>>>> don't
>>>>> see any logging saying it, so I'm not certain how the nuts and bolts of
>>>>> that work. By explicitly setting the classpath, we're bypassing Spark's
>>>>> native JAR broadcast though.
>>>>>
>>>>> Taking a quick look at the config in Ambari (which ships the config to
>>>>> each node after saving), in 'Custom spark-defaults' I have the following:
>>>>>
>>>>> spark.driver.extraClassPath ->
>>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>> spark.executor.extraClassPath ->
>>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>
>>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>>>>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>>>>> has the custom phoenix-client-spark.jar installed to that same path as 
>>>>> well.
>>>>>
>>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>>
>>>>> or pyspark via:
>>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>>
>>>>> I also do this as the Ambari-created 'spark' user, I think there was
>>>>> some fun HDFS permission issue otherwise.
>>>>>
>>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimi...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>>>>> colocated with RegionServers; all the hosts have everything. There are no
>>>>>> spark workers to restart. You're sure it's not shipped by the YARN 
>>>>>> runtime?
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmaho...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sadly, it needs to be installed onto each Spark worker (for now).
>>>>>>> The executor config tells each Spark worker to look for that file to 
>>>>>>> add to
>>>>>>> its classpath, so once you have it installed, you'll probably need to
>>>>>>> restart all the Spark workers.
>>>>>>>
>>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>>> consistently see is fine.
>>>>>>>
>>>>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>>>>> without this classpath nonsense, but we need to do some extra work on 
>>>>>>> the
>>>>>>> Phoenix side to make sure that Phoe

Re: phoenix-spark and pyspark

2016-01-20 Thread Josh Mahonin
That's great to hear. Looking forward to the doc patch!

On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <ndimi...@apache.org> wrote:

> Josh -- I deployed my updated phoenix build across the cluster, added the
> phoenix-client-spark.jar to configs on the whole cluster, and now basic
> dataframe access is now working. Let me see about updating the docs page to
> be more clear, I'll send a patch by you for review.
>
> Thanks a lot for the help!
> -n
>
> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN
>> as well. I suppose the JAR is probably shipped by YARN, though I don't see
>> any logging saying it, so I'm not certain how the nuts and bolts of that
>> work. By explicitly setting the classpath, we're bypassing Spark's native
>> JAR broadcast though.
>>
>> Taking a quick look at the config in Ambari (which ships the config to
>> each node after saving), in 'Custom spark-defaults' I have the following:
>>
>> spark.driver.extraClassPath ->
>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>> spark.executor.extraClassPath ->
>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>
>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>> has the custom phoenix-client-spark.jar installed to that same path as well.
>>
>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>
>> or pyspark via:
>> /usr/hdp/current/spark-client/bin/pyspark
>>
>> I also do this as the Ambari-created 'spark' user, I think there was some
>> fun HDFS permission issue otherwise.
>>
>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimi...@apache.org>
>> wrote:
>>
>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>> colocated with RegionServers; all the hosts have everything. There are no
>>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>>
>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmaho...@gmail.com>
>>> wrote:
>>>
>>>> Sadly, it needs to be installed onto each Spark worker (for now). The
>>>> executor config tells each Spark worker to look for that file to add to its
>>>> classpath, so once you have it installed, you'll probably need to restart
>>>> all the Spark workers.
>>>>
>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>> consistently see is fine.
>>>>
>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>> without this classpath nonsense, but we need to do some extra work on the
>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>>> through Spark's weird wrapper version of it.
>>>>
>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimi...@apache.org>
>>>> wrote:
>>>>
>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmaho...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What version of Spark are you using?
>>>>>>
>>>>>
>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say,
>>>>> and the welcome message in the pyspark console agrees.
>>>>>
>>>>> Are there any other traces of exceptions anywhere?
>>>>>>
>>>>>
>>>>> No other exceptions that I can find. YARN apparently doesn't want to
>>>>> aggregate spark's logs.
>>>>>
>>>>>
>>>>>> Are all your Spark nodes set up to point to the same
>>>>>> phoenix-client-spark JAR?
>>>>>>
>>>>>
>>>>> Yes, as far as I can tell... though come to think of it, is that jar
>>>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>>>> host? I only changed spark-defaults.conf on the client machine, the 
>>>>> machine
>>>>> from which I submitted the job.
>>>>>
>>>>> Thanks for taking a look Josh!
>>>>>
>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimi...@apache.org>
>>>

Re: phoenix-spark and pyspark

2016-01-19 Thread Josh Mahonin
Sadly, it needs to be installed onto each Spark worker (for now). The
executor config tells each Spark worker to look for that file to add to its
classpath, so once you have it installed, you'll probably need to restart
all the Spark workers.

I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
/usr/hdp/current/phoenix-client/, but anywhere that each worker can
consistently see is fine.

One day we'll be able to have Spark ship the JAR around and use it without
this classpath nonsense, but we need to do some extra work on the Phoenix
side to make sure that Phoenix's calls to DriverManager actually go through
Spark's weird wrapper version of it.

On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimi...@apache.org> wrote:

> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> What version of Spark are you using?
>>
>
> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
> the welcome message in the pyspark console agrees.
>
> Are there any other traces of exceptions anywhere?
>>
>
> No other exceptions that I can find. YARN apparently doesn't want to
> aggregate spark's logs.
>
>
>> Are all your Spark nodes set up to point to the same phoenix-client-spark
>> JAR?
>>
>
> Yes, as far as I can tell... though come to think of it, is that jar
> shipped by the spark runtime to workers, or is it loaded locally on each
> host? I only changed spark-defaults.conf on the client machine, the machine
> from which I submitted the job.
>
> Thanks for taking a look Josh!
>
> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimi...@apache.org> wrote:
>>
>>> Hi guys,
>>>
>>> I'm doing my best to follow along with [0], but I'm hitting some
>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>> using pyspark for now.
>>>
>>> I've added phoenix-$VERSION-client-spark.jar to both
>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>> around, this is commonly a problem with classpath. Any ideas as to
>>> specifically which jars are needed? Or better still, how to debug this
>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>> two netty versions in that lib directory...
>>>
>>> Thanks a lot,
>>> -n
>>>
>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>> [1]:
>>>
>>> java.lang.IllegalStateException: unread block data
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>> at
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>> at
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <jamestay...@apache.org>
>>> wrote:
>>>
>>>> Thanks for remembering about the docs, Josh.
>>>>
>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmaho...@gmail.com>
>>>> wrote:
>>>>
>>>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>>>> for 4.7.0 and the docs have been updated to

Re: phoenix-spark and pyspark

2016-01-19 Thread Josh Mahonin
Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN as
well. I suppose the JAR is probably shipped by YARN, though I don't see any
logging saying it, so I'm not certain how the nuts and bolts of that work.
By explicitly setting the classpath, we're bypassing Spark's native JAR
broadcast though.

Taking a quick look at the config in Ambari (which ships the config to each
node after saving), in 'Custom spark-defaults' I have the following:

spark.driver.extraClassPath ->
/etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
spark.executor.extraClassPath ->
/usr/hdp/current/phoenix-client/phoenix-client-spark.jar

I'm not sure if the /etc/hbase/conf is necessarily needed, but I think that
gets the Ambari generated hbase-site.xml in the classpath. Each node has
the custom phoenix-client-spark.jar installed to that same path as well.

I can pop into regular spark-shell and load RDDs/DataFrames using:
/usr/hdp/current/spark-client/bin/spark-shell --master yarn-client

or pyspark via:
/usr/hdp/current/spark-client/bin/pyspark

I also do this as the Ambari-created 'spark' user, I think there was some
fun HDFS permission issue otherwise.

On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimi...@apache.org> wrote:

> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
> colocated with RegionServers; all the hosts have everything. There are no
> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>
> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Sadly, it needs to be installed onto each Spark worker (for now). The
>> executor config tells each Spark worker to look for that file to add to its
>> classpath, so once you have it installed, you'll probably need to restart
>> all the Spark workers.
>>
>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>> consistently see is fine.
>>
>> One day we'll be able to have Spark ship the JAR around and use it
>> without this classpath nonsense, but we need to do some extra work on the
>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>> through Spark's weird wrapper version of it.
>>
>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimi...@apache.org>
>> wrote:
>>
>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmaho...@gmail.com>
>>> wrote:
>>>
>>>> What version of Spark are you using?
>>>>
>>>
>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
>>> the welcome message in the pyspark console agrees.
>>>
>>> Are there any other traces of exceptions anywhere?
>>>>
>>>
>>> No other exceptions that I can find. YARN apparently doesn't want to
>>> aggregate spark's logs.
>>>
>>>
>>>> Are all your Spark nodes set up to point to the same
>>>> phoenix-client-spark JAR?
>>>>
>>>
>>> Yes, as far as I can tell... though come to think of it, is that jar
>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>> host? I only changed spark-defaults.conf on the client machine, the machine
>>> from which I submitted the job.
>>>
>>> Thanks for taking a look Josh!
>>>
>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimi...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. 
>>>>> I'm
>>>>> using pyspark for now.
>>>>>
>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs 
>>>>> and
>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>>> get "IllegalStateException: unread block data" [1] from the workers. 
>>>>> Poking
>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>> specifically which jars are needed? Or better still, how to debug this
>>>&g

Re: StaleRegionBoundaryCacheException with Phoenix 4.5.1-HBase 1.0 and Spark 1.4.1

2016-01-16 Thread Josh Mahonin
Hi Li,

This should be resolved in Phoenix 4.7.0. You can also apply the patch
yourself from here:
https://issues.apache.org/jira/browse/PHOENIX-2599

Please let us know if it fixes the problem for you.

Thanks!

Josh

On Thu, Jan 14, 2016 at 6:49 PM, James Taylor <jamestay...@apache.org>
wrote:

> That was my first thought too, Alicia. However, our MR and Spark
> integration uses a different code path. See my comment on PHOENIX-2599 for
> a potential fix.
>
> Thanks,
> James
>
> On Thu, Jan 14, 2016 at 2:55 PM, Alicia Shu <a...@hortonworks.com> wrote:
>
>> With the fix of PHOENIX-2447
>> <https://issues.apache.org/jira/browse/PHOENIX-2447> in 4.7.0, now
>> Phoenix will try multiple times for StaleRegionBoundaryCacheException until
>> query timeout.
>>
>> Alicia
>>
>> From: Josh Mahonin <jmaho...@gmail.com>
>> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> Date: Thursday, January 14, 2016 at 12:33 PM
>> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> Subject: Re: StaleRegionBoundaryCacheException with Phoenix 4.5.1-HBase
>> 1.0 and Spark 1.4.1
>>
>> Hi Li,
>>
>> I've not seen this error myself, though some searching returns a possible
>> root cause:
>>
>> http://mail-archives.apache.org/mod_mbox/incubator-phoenix-user/201507.mbox/%3CCAAF1JdjNW98dAnxf3kx=ndkswyorpt1redb9dqwjbhvvlfn...@mail.gmail.com%3E
>>
>> Could you file a JIRA ticket for this please? It's possible the MapReduce
>> (and by extension, Spark) integration isn't handling this gracefully.
>>
>> Also, if possible, any information you have about your environment, table
>> sizes, read/write workload, etc. would be helpful as well.
>>
>> Thanks!
>>
>> Josh
>>
>> On Thu, Jan 14, 2016 at 3:14 PM, Li Gao <g...@marinsoftware.com> wrote:
>>
>>> Hi Phoenix users,
>>>
>>>
>>> We are seeing occasionally (maybe 30 ~ 50% of time) this
>>> StaleRegionBoundaryCacheException when running Spark 1.4.1 and Phoenix
>>> 4.5.1-HBase 1.0.
>>>
>>> Not sure how to troubleshoot such issue. Any hints and insights are
>>> greatly appreciated!
>>>
>>> Thanks,
>>>
>>> Li
>>>
>>> PS: The following are the exception stack trace:
>>>
>>> 16/01/14 19:40:16 ERROR yarn.ApplicationMaster: User class threw
>>> exception: org.apache.spark.SparkException: Job aborted due to stage
>>> failure: Task 5 in stage 110.0 failed 4 times, most recent failure: Lost
>>> task 5.3 in stage 110.0 (TID 35526, datanode-123.somewhere):
>>> java.lang.RuntimeException:
>>> org.apache.phoenix.schema.StaleRegionBoundaryCacheException: ERROR 1108
>>> (XCL08): Cache of region boundaries are out of date.
>>>
>>> at com.google.common.base.Throwables.propagate(Throwables.java:156)
>>>
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixRecordReader.initialize(PhoenixRecordReader.java:126)
>>>
>>> at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
>>>
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>>>
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>>>
>>> at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:52)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>>>
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>>
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor

Re: StaleRegionBoundaryCacheException with Phoenix 4.5.1-HBase 1.0 and Spark 1.4.1

2016-01-14 Thread Josh Mahonin
Hi Li,

I've not seen this error myself, though some searching returns a possible
root cause:
http://mail-archives.apache.org/mod_mbox/incubator-phoenix-user/201507.mbox/%3CCAAF1JdjNW98dAnxf3kx=ndkswyorpt1redb9dqwjbhvvlfn...@mail.gmail.com%3E

Could you file a JIRA ticket for this please? It's possible the MapReduce
(and by extension, Spark) integration isn't handling this gracefully.

Also, if possible, any information you have about your environment, table
sizes, read/write workload, etc. would be helpful as well.

Thanks!

Josh

On Thu, Jan 14, 2016 at 3:14 PM, Li Gao  wrote:

> Hi Phoenix users,
>
>
> We are seeing occasionally (maybe 30 ~ 50% of time) this
> StaleRegionBoundaryCacheException when running Spark 1.4.1 and Phoenix
> 4.5.1-HBase 1.0.
>
> Not sure how to troubleshoot such issue. Any hints and insights are
> greatly appreciated!
>
> Thanks,
>
> Li
>
> PS: The following are the exception stack trace:
>
> 16/01/14 19:40:16 ERROR yarn.ApplicationMaster: User class threw
> exception: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 5 in stage 110.0 failed 4 times, most recent failure: Lost
> task 5.3 in stage 110.0 (TID 35526, datanode-123.somewhere):
> java.lang.RuntimeException:
> org.apache.phoenix.schema.StaleRegionBoundaryCacheException: ERROR 1108
> (XCL08): Cache of region boundaries are out of date.
>
> at com.google.common.base.Throwables.propagate(Throwables.java:156)
>
> at
> org.apache.phoenix.mapreduce.PhoenixRecordReader.initialize(PhoenixRecordReader.java:126)
>
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
>
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>
> at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:52)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.phoenix.schema.StaleRegionBoundaryCacheException:
> ERROR 1108 (XCL08): Cache of region boundaries are out of date.
>
> at
> org.apache.phoenix.exception.SQLExceptionCode$13.newException(SQLExceptionCode.java:304)
>
> at
> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
>
> at
> org.apache.phoenix.util.ServerUtil.parseRemoteException(ServerUtil.java:131)
>
> at
> org.apache.phoenix.util.ServerUtil.parseServerExceptionOrNull(ServerUtil.java:115)
>
> at
> org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:104)
>
> at
> org.apache.phoenix.iterate.TableResultIterator.getDelegate(TableResultIterator.java:70)
>
> at
> org.apache.phoenix.iterate.TableResultIterator.(TableResultIterator.java:88)
>
> at
> org.apache.phoenix.iterate.TableResultIterator.(TableResultIterator.java:79)
>
> at
> org.apache.phoenix.mapreduce.PhoenixRecordReader.initialize(PhoenixRecordReader.java:111)
>
> ... 18 more
>


Re: Re: error when get data from Phoenix 4.5.2 on CDH 5.5.x by spark 1.5

2016-01-08 Thread Josh Mahonin
[Sent this same message from another account, apologies if anyone gets a
double-post]

Thanks for the update. I hadn’t even seen the ‘SPARK_DIST_CLASSPATH’
setting until just now, but I suspect for CDH that might be the only way to
do it.

The reason for the class path errors you see is that the ‘client’ JAR on
its own ships a few extra dependencies (i.e. com.fasterxml.jackson) which
are incompatible with newer versions of Spark. The ‘client-spark’ JAR
attempts to remove those dependencies which would conflict with Spark,
although a more elegant solution will likely come with PHOENIX-2535 (
https://issues.apache.org/jira/browse/PHOENIX-2535)

Re: speed, the spark integration should be just about as fast as the
MapReduce and Pig integration. At the 3T level, your likely bottleneck is
disk IO just to load the data in, although network IO is also a possibility
here as well. Assuming you have a sufficient number of Spark workers with
enough RAM allocated, once the data is loaded into Spark initially,
operations on that dataset should proceed much faster, as much of the data
will be available in RAM vs disk.

Best of luck,

Josh


On Fri, Jan 8, 2016 at 7:34 AM, sac...@outlook.com <sac...@outlook.com>
wrote:

> hi josh
>
>  Yes ,it is still  the same 'No suitable driver' exception.
>  And my boss may solve the problem. the method is amazing but
> it did succeed.
>  he add the "export 
> SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:/opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/phoenix-1.2.0-client.jar"
> in
> spark-env.sh ,and restart the spark .   And later ,everythig seems good.
>   What he add is the "phoenix-1.2.0-client.jar" which cause the exception
>  "java.lang.NoSuchMethodError:
> com.fasterxml.jackson.databind.Module$SetupContext.setClassIntrospector"
>  。 It is incredible.   I  really appreciate what you have done for me .  If
>  you could tell me why  it happens,  i will be more happy.
>
> Besides ,this afternoon  i met  anothor  problem.  I  have
> made a table  which  include 13 columns,  such as (A.B.C.)  and the
> primary key is (A,B),  When i import all the data to this table ,i  delete
> the  original data.   But now  i want to get another table ,which should
> inculde the same 13 columns ,but the diffrent  primary key  which should be
> (B,C).   The data is big ,about 3T .   Could you tell me how to do it
>  fastly.I have tried to do this by spark, but it seems not  fast.
>
>
>
> Best wishes for  you!
>
>
>
>
>
> --
> sac...@outlook.com
>
>
> *From:* Josh Mahonin <jmaho...@gmail.com>
> *Date:* 2016-01-06 23:02
> *To:* user <user@phoenix.apache.org>
> *Subject:* Re: Re: error when get data from Phoenix 4.5.2 on CDH 5.5.x by
> spark 1.5
> Hi,
>
> Is it still the same 'No suitable driver' exception, or is it something
> else?
>
> Have you tried using the 'yarn-cluster' mode? I've had success with that
> personally, although I don't have any experience on the CDH stack.
>
> Josh
>
>
>
> On Wed, Jan 6, 2016 at 2:59 AM, sac...@outlook.com <sac...@outlook.com>
> wrote:
>
>> hi josh:
>>
>> I did what you say, and now i can run my codes in
>>  spark-shell --master local without any other confs , but when it comes
>> to 'yarn-client'  ,the error is as the same .
>>
>> I  try to give you more informations, we have 11 nodes,3
>> zks,2 masters.  I am sure all the 11nodes have the client-spark.jar in
>> the path referred in the spark-defaults.conf.
>> We use the CDH5.5 with spark1.5. our phoenix version is   "
>> http://archive.cloudera.com/cloudera-labs/phoenix/parcels/1.2/;
>> <http://archive.cloudera.com/cloudera-labs/phoenix/parcels/1.2/> which
>> based phoenix 4.5.2.
>> I built the spark-client jar  in the following steps
>> 1.download the  base code in "
>> https://github.com/cloudera-labs/phoenix;
>> <https://github.com/cloudera-labs/phoenix>
>> 2. path the PHOENIX-2503.patch manualy
>> 3 build
>>
>> it is amazing that it did work in local ,but not in
>> yarn-client mode .  Waiting for your reply eagerly.Thank you very much.
>>
>>
>> my spark-defaults.conf is
>>
>> spark.authenticate=false
>> spark.dynamicAllocation.enabled=true
>> spark.dynamicAllocation.executorIdleTimeout=60
>> spark.dynamicAllocation.minExecutors=0
>> spark.dynamicAllocation.schedulerBacklogTimeout=1
>> spark.eventLog.dir=hdfs://cdhcluster1/user/spark/applicationHistory
>> spark.eventLog.enabled=t

Re: Re: error when get data from Phoenix 4.5.2 on CDH 5.5.x by spark 1.5

2016-01-06 Thread Josh Mahonin
Hi,

Is it still the same 'No suitable driver' exception, or is it something
else?

Have you tried using the 'yarn-cluster' mode? I've had success with that
personally, although I don't have any experience on the CDH stack.

Josh



On Wed, Jan 6, 2016 at 2:59 AM, sac...@outlook.com <sac...@outlook.com>
wrote:

> hi josh:
>
> I did what you say, and now i can run my codes in
>  spark-shell --master local without any other confs , but when it comes
> to 'yarn-client'  ,the error is as the same .
>
> I  try to give you more informations, we have 11 nodes,3
> zks,2 masters.  I am sure all the 11nodes have the client-spark.jar in
> the path referred in the spark-defaults.conf.
> We use the CDH5.5 with spark1.5. our phoenix version is   "
> http://archive.cloudera.com/cloudera-labs/phoenix/parcels/1.2/;
> <http://archive.cloudera.com/cloudera-labs/phoenix/parcels/1.2/> which
> based phoenix 4.5.2.
> I built the spark-client jar  in the following steps
> 1.download the  base code in "
> https://github.com/cloudera-labs/phoenix;
> <https://github.com/cloudera-labs/phoenix>
> 2. path the PHOENIX-2503.patch manualy
> 3 build
>
> it is amazing that it did work in local ,but not in
> yarn-client mode .  Waiting for your reply eagerly.Thank you very much.
>
>
> my spark-defaults.conf is
>
> spark.authenticate=false
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.executorIdleTimeout=60
> spark.dynamicAllocation.minExecutors=0
> spark.dynamicAllocation.schedulerBacklogTimeout=1
> spark.eventLog.dir=hdfs://cdhcluster1/user/spark/applicationHistory
> spark.eventLog.enabled=true
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.shuffle.service.enabled=true
> spark.shuffle.service.port=7337
>
> spark.executor.extraClassPath=/opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/phoenix-1.2.0-client-spark.jar
>
> spark.driver.extraClassPath=/opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/phoenix-1.2.0-client-spark.jar
> spark.yarn.historyServer.address=http://cdhmaster1.boloomo.com:18088
>
> spark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/lib/spark-assembly.jar
>
> spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native
>
> spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native
>
> spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native
> spark.yarn.config.gatewayPath=/opt/cloudera/parcels
> spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
> spark.master=yarn-client
>
> And here is my code again
> import org.apache.spark.sql.SQLContext
> import org.apache.phoenix.spark._
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> import org.apache.phoenix.jdbc.PhoenixDriver
> import java.sql.DriverManager
>  DriverManager.registerDriver(new PhoenixDriver)
>  Class.forName("org.apache.phoenix.jdbc.PhoenixDriver");
> val pred = s"RID like 'wtwb2%' and TIME between 1325440922 and 1336440922"
> val rdd = sc.phoenixTableAsRDD(
>   "AIS_AREA",
>   Seq("MMSI","LON","LAT","RID"),
>   predicate = Some(pred),
>   zkUrl = Some("cdhzk3.boloomo.com:2181"))
>
> and  i use only spark-shell to run the code this time.
>
>
> --
> sac...@outlook.com
>
>
> *From:* Josh Mahonin <jmaho...@gmail.com>
> *Date:* 2016-01-05 23:41
> *To:* user <user@phoenix.apache.org>
> *Subject:* Re: Re: error when get data from Phoenix 4.5.2 on CDH 5.5.x by
> spark 1.5
> Hi,
>
> The error "java.sql.SQLException: No suitable driver found..." is
> typically thrown when the worker nodes can't find Phoenix on the class path.
>
> I'm not certain that passing those values using '--conf' actually works or
> not with Spark. I tend to set them in my 'spark-defaults.conf' in the Spark
> configuration folder. I think restarting the master and workers may be
> required as well.
>
> Josh
>
> On Tue, Jan 5, 2016 at 5:38 AM, mengfei <sac...@outlook.com> wrote:
>
>> hi josh:
>>
>> thank you for your advice,and it did work . i build the
>> client-spark jar refreed the patch with thr CDH code and it succeed.
>> Then i run some code with the "local" mode ,and the
>> result is correct. But when it comes to the "yarn-client" mode ,some error
>> happend:
>>
>>
>>  java.

Re: Re: error when get data from Phoenix 4.5.2 on CDH 5.5.x by spark 1.5

2015-12-30 Thread Josh Mahonin
Hi,

Rather than pass in the JARs using the '--jars' flag, what happens if you
include them all in the 'extraClassPath' settings, or the SPARK_CLASSPATH
environment variable? That specific class, PhoenixConfigurationUtil, is in
the phoenix-core JAR, maybe do a 'unzip -l' on it to make sure that is in
in fact included?

Alternatively, if you're familiar with git/patch/maven, you can checkout
the 4.6.0-HBase-1.1 release, apply that patch in PHOENIX-2503, and use the
resulting 'client-spark' JAR in your classpath.

Josh



On Wed, Dec 30, 2015 at 8:24 AM, sac...@outlook.com <sac...@outlook.com>
wrote:

> hi josh:
>
>i did use add the jars by using the --jars, the 
> 'com.fasterxml.jackson'`s
>  error disappeared,but here  raise a new exception:
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil
>
>
> at 
> org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:73)
>
>
> at 
> org.apache.phoenix.spark.PhoenixRDD.phoenixConf$lzycompute(PhoenixRDD.scala:39)
>
>
> at 
> org.apache.phoenix.spark.PhoenixRDD.phoenixConf(PhoenixRDD.scala:38)
>
> at org.apache.phoenix.spark.PhoenixRDD.(PhoenixRDD.scala:42)
>
>  .
>
>   My conf is as follows
>
>
> --conf 
> "spark.executor.extraClassPath=/data/public/spark/libs/phoenix-spark-4.6.0-HBase-1.1.jar"
>  \
>
>
> --conf 
> "spark.driver.extraClassPath=/data/public/spark/libs/phoenix-spark-4.6.0-HBase-1.1.jar"
>  \
>
> --jars /data/public/mengfei/spark/libs/guava-12.0.1.jar, \
>
> /data/public/spark/libs/hbase-client-1.1.0.jar, \
>
> /data/public/spark/libs/hbase-common-1.1.0.jar, \
>
> /data/public/spark/libs/hbase-protocol-1.1.0.jar, \
>
> /data/public/spark/libs/hbase-server-1.1.0.jar, \
>
> /data/public/spark/libs/htrace-core-3.1.0-incubating.jar, \
>
> /data/public/spark/libs/phoenix-4.6.0-HBase-1.1-client.jar, \
>
> /data/public/spark/libs/phoenix-core-4.6.0-HBase-1.1.jar, \
>
> /data/public/spark/libs/phoenix-spark-4.6.0-HBase-1.1.jar
>
>
> Thank you
> --
> sac...@outlook.com
>
>
> *From:* Josh Mahonin <jmaho...@gmail.com>
> *Date:* 2015-12-30 00:56
> *To:* user <user@phoenix.apache.org>
> *Subject:* Re: error when get data from Phoenix 4.5.2 on CDH 5.5.x by
> spark 1.5
> Hi,
>
> This issue is fixed with the following patch, and using the resulting
> 'client-spark' JAR after compilation:
> https://issues.apache.org/jira/browse/PHOENIX-2503
>
> As an alternative, you may have some luck also including updated
> com.fasterxml.jackson jackson-databind JARs in your app that are in sync
> with Spark's versions. Unfortunately the client JAR right now is shipping
> fasterxml jars that conflict with the Spark runtime.
>
> Another user has also had success by bundling their own Phoenix
> dependencies, if you want to try that out instead:
>
> http://mail-archives.apache.org/mod_mbox/incubator-phoenix-user/201512.mbox/%3c0f96d592-74d7-431a-b301-015374a6b...@sandia.gov%3E
>
> Josh
>
>
>
> On Tue, Dec 29, 2015 at 9:11 AM, sac...@outlook.com <sac...@outlook.com>
> wrote:
>
>> The error is
>>
>> java.lang.NoSuchMethodError:
>> com.fasterxml.jackson.databind.Module$SetupContext.setClassIntrospector(Lcom/fasterxml/jackson/databind/introspect/ClassIntrospector;)V
>>
>> at
>> com.fasterxml.jackson.module.scala.introspect.ScalaClassIntrospectorModule$$anonfun$1.apply(ScalaClassIntrospector.scala:32)
>>
>> at
>> com.fasterxml.jackson.module.scala.introspect.ScalaClassIntrospectorModule$$anonfun$1.apply(ScalaClassIntrospector.scala:32)
>>
>> at
>> com.fasterxml.jackson.module.scala.JacksonModule$$anonfun$setupModule$1.apply(JacksonModule.scala:47)
>>
>>   …..
>>
>> The scala code is
>>
>> val df = sqlContext.load(
>>
>>   "org.apache.phoenix.spark",
>>
>>   Map("table" -> "AIS ", "zkUrl" -> "cdhzk1.ccco.com:2181")
>>
>> )
>>
>>
>>
>> Maybe I got the resoon ,the Phoenix 4.5.2 on  CDH 5.5.x is build with
>> spark 1.4 ,and cdh5.5`defalut spark version is 1.5.
>>
>> So  how could I do?  To rebuild a phoenix 4.5.2 version with spark 1.5 Or
>> change the cdh spark to 1.4.  Apreantly these are difficult for me . Could
>> someone help me ? Thank you vey much.
>>
>>
>>
>
>


Re: error when get data from Phoenix 4.5.2 on CDH 5.5.x by spark 1.5

2015-12-29 Thread Josh Mahonin
Hi,

This issue is fixed with the following patch, and using the resulting
'client-spark' JAR after compilation:
https://issues.apache.org/jira/browse/PHOENIX-2503

As an alternative, you may have some luck also including updated
com.fasterxml.jackson jackson-databind JARs in your app that are in sync
with Spark's versions. Unfortunately the client JAR right now is shipping
fasterxml jars that conflict with the Spark runtime.

Another user has also had success by bundling their own Phoenix
dependencies, if you want to try that out instead:
http://mail-archives.apache.org/mod_mbox/incubator-phoenix-user/201512.mbox/%3c0f96d592-74d7-431a-b301-015374a6b...@sandia.gov%3E

Josh



On Tue, Dec 29, 2015 at 9:11 AM, sac...@outlook.com 
wrote:

> The error is
>
> java.lang.NoSuchMethodError:
> com.fasterxml.jackson.databind.Module$SetupContext.setClassIntrospector(Lcom/fasterxml/jackson/databind/introspect/ClassIntrospector;)V
>
> at
> com.fasterxml.jackson.module.scala.introspect.ScalaClassIntrospectorModule$$anonfun$1.apply(ScalaClassIntrospector.scala:32)
>
> at
> com.fasterxml.jackson.module.scala.introspect.ScalaClassIntrospectorModule$$anonfun$1.apply(ScalaClassIntrospector.scala:32)
>
> at
> com.fasterxml.jackson.module.scala.JacksonModule$$anonfun$setupModule$1.apply(JacksonModule.scala:47)
>
>   …..
>
> The scala code is
>
> val df = sqlContext.load(
>
>   "org.apache.phoenix.spark",
>
>   Map("table" -> "AIS ", "zkUrl" -> "cdhzk1.ccco.com:2181")
>
> )
>
>
>
> Maybe I got the resoon ,the Phoenix 4.5.2 on  CDH 5.5.x is build with
> spark 1.4 ,and cdh5.5`defalut spark version is 1.5.
>
> So  how could I do?  To rebuild a phoenix 4.5.2 version with spark 1.5 Or
> change the cdh spark to 1.4.  Apreantly these are difficult for me . Could
> someone help me ? Thank you vey much.
>
>
>


Re: Does phoenix spark support arbitrary SELECT statement?

2015-12-15 Thread Josh Mahonin
Hi Li,

When using the DataFrame integration, it supports arbitrary SELECT
statements. Column pruning and predicate filtering is pushed down to
Phoenix, and aggregate functions are executed within Spark.

When using RDDs directly, you can specify a table name, columns and an
optional WHERE predicate for basic filtering. Aggregate functions however
are not supported.

The integration tests have a reasonably thorough set of examples on both
DataFrames and RDDs with Phoenix. [1]

Good luck,

Josh

[1]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala

On Tue, Dec 15, 2015 at 5:57 PM, Li Gao  wrote:

> Hi community,
>
> Does Phoenix Spark support arbitrary SELECT statements for generating DF
> or RDD?
>
> From this reading: https://phoenix.apache.org/phoenix_spark.html I am not
> sure how to do that.
>
> Thanks,
> Li
>
>


Re: [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin / Various Errors

2015-12-10 Thread Josh Mahonin
Thanks Jonathan,

I'm making some headway on getting a the client library working again. I
thought I saw a mention that you were using pyspark as well using the
DataFrame support. Are you able to confirm this works as well?

Thanks!

Josh

On Wed, Dec 9, 2015 at 7:51 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:

> Josh,
>
> I added all of those JARs separately to Spark's class paths, and it seems
> to be working fine now.
>
> Thanks a lot for your help!
>
> Sent from my iPhone
>
> On Dec 9, 2015, at 2:30 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
> Thanks Jonathan,
>
> I'll follow-up with the issue there. In the meantime, you may have some
> luck just submitting a fat (assembly) JAR to a spark cluster.
>
> If you really want to dive into the nitty-gritty, I'm decomposing the
> client JAR down to the required components that allow for the Spark
> integration to work  (especially excluding the fasterxml JARs). If you were
> to manually assemble the following libraries into the Spark classpath, I
> believe you'll be able to get the spark-shell going:
>
> guava-12.0.1.jar  hbase-common-1.1.0.jar  hbase-server-1.1.0.jar
>  phoenix-core-4.6.0-HBase-1.1.jar  hbase-client-1.1.0.jar
>  hbase-protocol-1.1.0.jar  htrace-core-3.1.0-incubating.jar
>  phoenix-spark-4.6.0-HBase-1.1.jar
>
> Thanks for the report.
>
> Josh
>
> On Wed, Dec 9, 2015 at 4:00 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:
>
>> Thanks, Josh. I submitted the issue, which can be found at:
>> https://issues.apache.org/jira/browse/PHOENIX-2503
>>
>>
>>
>> Multiple Java NoClass/Method Errors with Spark and Phoenix
>>
>>
>>
>> *From:* Josh Mahonin [mailto:jmaho...@gmail.com]
>> *Sent:* Wednesday, December 09, 2015 1:15 PM
>>
>> *To:* user@phoenix.apache.org
>> *Subject:* Re: [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin
>> / Various Errors
>>
>>
>>
>> Hi Jonathan,
>>
>>
>>
>> Thanks, I'm digging into this as we speak. That SPARK-8332 issue looks
>> like the same issue, and to quote one of the comments in that issue
>> 'Classpath hell is hell'.
>>
>>
>>
>> What is interesting is that the unit tests in Phoenix 4.6.0 successfully
>> run against Spark 1.5.2 [1], so I wonder if this is issue is specific to
>> the spark-shell. You may have some success compiling your app as an
>> assembly JAR and submitting it to a Spark cluster instead.
>>
>>
>>
>> Could you do me a favour and file a JIRA ticket for this, and copy all
>> the relevant information you've posted there?
>>
>>
>>
>> Thanks!
>>
>> Josh
>>
>> [1]
>> https://github.com/apache/phoenix/blob/master/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala
>>
>>
>>
>> On Wed, Dec 9, 2015 at 2:52 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:
>>
>> Josh,
>>
>>
>>
>> I’d like to give you a little more information regarding this error. It
>> looks like when I add the Phoenix Client JAR to Spark, it causes Spark to
>> fail:
>>
>> spark.executor.extraClassPath
>> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>>
>> spark.driver.extraClassPath
>> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>>
>>
>>
>> After adding this JAR, I get the following error when excuting the
>> following command:
>>
>> scala> val textFile = sc.textFile("README.md")
>>
>> java.lang.NoSuchMethodError:
>> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
>>
>> at
>> com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala:49)
>>
>> at
>> com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala)
>>
>>
>>
>> As you can see, adding this phoenix JAR is breaking other Spark
>> functionality for me. My naïve guess is that there is a different version
>> of the Jackson FasterXML classes packaged inside
>> phoenix-4.6.0-HBase-1.1-client.jar that is breaking Spark.
>>
>>
>>
>> Have you seen anything like this before?
>>
>>
>>
>> Regards,
>>
>> Jonathan
>>
>>
>>
>> *From:* Cox, Jonathan A [mailto:ja...@sandia.gov]
>> *Sent:* Wednesday, December 09, 2015 11:58 AM
>> *To:* user@phoenix.apache.org
>> *Subject:* [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin /
>> Various Errors
&g

Re: phoenix-spark and pyspark

2015-12-10 Thread Josh Mahonin
Hey Nick,

I think this used to work, and will again once PHOENIX-2503 gets resolved.
With the Spark DataFrame support, all the necessary glue is there for
Phoenix and pyspark to play nice. With that client JAR (or by overriding
the com.fasterxml.jackson JARS), you can do something like:

df = sqlContext.read \
  .format("org.apache.phoenix.spark") \
  .option("table", "TABLE1") \
  .option("zkUrl", "localhost:63512") \
  .load()

And

df.write \
  .format("org.apache.phoenix.spark") \
  .mode("overwrite") \
  .option("table", "TABLE1") \
  .option("zkUrl", "localhost:63512") \
  .save()


Yes, this should be added to the documentation. I hadn't actually tried
this till just now. :)

On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk  wrote:

> Heya,
>
> Has anyone any experience using phoenix-spark integration from pyspark
> instead of scala? Folks prefer python around here...
>
> I did find this example [0] of using HBaseOutputFormat from pyspark,
> haven't tried extending it for phoenix. Maybe someone with more experience
> in pyspark knows better? Would be a great addition to our documentation.
>
> Thanks,
> Nick
>
> [0]:
> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>


Re: [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin / Various Errors

2015-12-09 Thread Josh Mahonin
Hi Jonathan,

Thanks for the information. If you're able, could you also try the
'SPARK_CLASSPATH' environment variable instead of the spark-defaults.conf
setting, and let us know if that works? Also the exact Spark package you're
using would be helpful as well (from source, prebuilt for 2.6+, 2.4+, CDH,
etc.)

Thanks,

Josh

On Wed, Dec 9, 2015 at 12:08 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:

> Alright, I reproduced what you did exactly, and it now works. The problem
> is that the Phoenix client JAR is not working correctly with the Spark
> builds that include Hadoop.
>
> When I downloaded the Spark build with user provided Hadoop, and also
> installed Hadoop manually, Spark works with Phoenix correctly!
>
> Thank you much,
> Jonathan
>
> Sent from my iPhone
>
> On Dec 8, 2015, at 8:54 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
> Hi Jonathan,
>
> Spark only needs the client JAR. It contains all the other Phoenix
> dependencies as well.
>
> I'm not sure exactly what the issue you're seeing is. I just downloaded
> and extracted fresh copies of Spark 1.5.2 (pre-built with user-provided
> Hadoop), and the latest Phoenix 4.6.0 binary release.
>
> I copied the 'phoenix-4.6.0-HBase-1.1-client.jar' to /tmp and created a
> 'spark-defaults.conf' in the 'conf' folder of the Spark install with the
> following:
>
> spark.executor.extraClassPath /tmp/phoenix-4.6.0-HBase-1.1-client.jar
> spark.driver.extraClassPath /tmp/phoenix-4.6.0-HBase-1.1-client.jar
>
> I then launched the 'spark-shell', and was able to execute:
>
> import org.apache.phoenix.spark._
>
> From there, you should be able to use the methods provided by the
> phoenix-spark integration within the Spark shell.
>
> Good luck,
>
> Josh
>
> On Tue, Dec 8, 2015 at 8:51 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:
>
>> I am trying to get Spark up and running with Phoenix, but the
>> installation instructions are not clear to me, or there is something else
>> wrong. I’m using Spark 1.5.2, HBase 1.1.2 and Phoenix 4.6.0 with a
>> standalone install (no HDFS or cluster) with Debian Linux 8 (Jessie) x64.
>> I’m also using Java 1.8.0_40.
>>
>>
>>
>> The instructions state:
>>
>> 1.   Ensure that all requisite Phoenix / HBase platform dependencies
>> are available on the classpath for the Spark executors and drivers
>>
>> 2.   One method is to add the phoenix-4.4.0-client.jar to
>> ‘SPARK_CLASSPATH’ in spark-env.sh, or setting both
>> ‘spark.executor.extraClassPath’ and ‘spark.driver.extraClassPath’ in
>> spark-defaults.conf
>>
>>
>>
>> *First off, what are “all requisite Phoenix / HBase platform
>> dependencies”?* #2 suggests that all I need to do is add
>>  ‘phoenix-4.6.0-HBase-1.1-client.jar’ to Spark’s class path. But what about
>> ‘phoenix-spark-4.6.0-HBase-1.1.jar’ or ‘phoenix-core-4.6.0-HBase-1.1.jar’?
>> Do either of these (or anything else) need to be added to Spark’s class
>> path?
>>
>>
>>
>> Secondly, if I follow the instructions exactly, and add only
>> ‘phoenix-4.6.0-HBase-1.1-client.jar’ to ‘spark-defaults.conf’:
>>
>> spark.executor.extraClassPath
>> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>>
>> spark.driver.extraClassPath
>> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>>
>> Then I get the following error when starting the interactive Spark shell
>> with ‘spark-shell’:
>>
>> 15/12/08 18:38:05 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so recording
>> the schema version 1.2.0
>>
>> 15/12/08 18:38:05 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>>
>> 15/12/08 18:38:05 WARN Hive: Failed to access metastore. This class
>> should not accessed in runtime.
>>
>> org.apache.hadoop.hive.ql.metadata.HiveException:
>> java.lang.RuntimeException: Unable to instantiate
>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>>
>> at
>> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
>>
>> …
>>
>>
>>
>> :10: error: not found: value sqlContext
>>
>>import sqlContext.implicits._
>>
>>   ^
>>
>> :10: error: not found: value sqlContext
>>
>>import sqlContext.sql
>>
>>
>>
>> On the other hand, if I include all three of the aforementioned JARs, I
>> get the same error. However, *if I include only the
>> ‘phoenix-spark-

Re: [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin / Various Errors

2015-12-09 Thread Josh Mahonin
Thanks Jonathan,

I'll follow-up with the issue there. In the meantime, you may have some
luck just submitting a fat (assembly) JAR to a spark cluster.

If you really want to dive into the nitty-gritty, I'm decomposing the
client JAR down to the required components that allow for the Spark
integration to work  (especially excluding the fasterxml JARs). If you were
to manually assemble the following libraries into the Spark classpath, I
believe you'll be able to get the spark-shell going:

guava-12.0.1.jar  hbase-common-1.1.0.jar  hbase-server-1.1.0.jar
 phoenix-core-4.6.0-HBase-1.1.jar  hbase-client-1.1.0.jar
 hbase-protocol-1.1.0.jar  htrace-core-3.1.0-incubating.jar
 phoenix-spark-4.6.0-HBase-1.1.jar

Thanks for the report.

Josh

On Wed, Dec 9, 2015 at 4:00 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:

> Thanks, Josh. I submitted the issue, which can be found at:
> https://issues.apache.org/jira/browse/PHOENIX-2503
>
>
>
> Multiple Java NoClass/Method Errors with Spark and Phoenix
>
>
>
> *From:* Josh Mahonin [mailto:jmaho...@gmail.com]
> *Sent:* Wednesday, December 09, 2015 1:15 PM
>
> *To:* user@phoenix.apache.org
> *Subject:* Re: [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin /
> Various Errors
>
>
>
> Hi Jonathan,
>
>
>
> Thanks, I'm digging into this as we speak. That SPARK-8332 issue looks
> like the same issue, and to quote one of the comments in that issue
> 'Classpath hell is hell'.
>
>
>
> What is interesting is that the unit tests in Phoenix 4.6.0 successfully
> run against Spark 1.5.2 [1], so I wonder if this is issue is specific to
> the spark-shell. You may have some success compiling your app as an
> assembly JAR and submitting it to a Spark cluster instead.
>
>
>
> Could you do me a favour and file a JIRA ticket for this, and copy all the
> relevant information you've posted there?
>
>
>
> Thanks!
>
> Josh
>
> [1]
> https://github.com/apache/phoenix/blob/master/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala
>
>
>
> On Wed, Dec 9, 2015 at 2:52 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:
>
> Josh,
>
>
>
> I’d like to give you a little more information regarding this error. It
> looks like when I add the Phoenix Client JAR to Spark, it causes Spark to
> fail:
>
> spark.executor.extraClassPath
> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>
> spark.driver.extraClassPath
> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>
>
>
> After adding this JAR, I get the following error when excuting the
> following command:
>
> scala> val textFile = sc.textFile("README.md")
>
> java.lang.NoSuchMethodError:
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
>
> at
> com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala:49)
>
> at
> com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala)
>
>
>
> As you can see, adding this phoenix JAR is breaking other Spark
> functionality for me. My naïve guess is that there is a different version
> of the Jackson FasterXML classes packaged inside
> phoenix-4.6.0-HBase-1.1-client.jar that is breaking Spark.
>
>
>
> Have you seen anything like this before?
>
>
>
> Regards,
>
> Jonathan
>
>
>
> *From:* Cox, Jonathan A [mailto:ja...@sandia.gov]
> *Sent:* Wednesday, December 09, 2015 11:58 AM
> *To:* user@phoenix.apache.org
> *Subject:* [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin /
> Various Errors
>
>
>
> Josh,
>
>
>
> So using user provided Hadoop 2.6 solved the immediate Phoenix / Spark
> integration problem I was having. However, I now have another problem,
> which seems to be similar to:
>
> https://issues.apache.org/jira/browse/SPARK-8332
>
> java.lang.NoSuchMethodError:
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
>
>
>
> I’m getting this error when executing the simple example in the Phoenix /
> Spark Plugin page:
>
> Spark context available as sc.
>
> 15/12/09 11:51:02 INFO repl.SparkILoop: Created sql context..
>
> SQL context available as sqlContext.
>
>
>
> scala> val df = sqlContext.load(
>
>  |   "org.apache.phoenix.spark",
>
>  |   Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
>
>  | )
>
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
>
> java.lang.NoSuchMethodError:
> com.fasterxml.jackson.module.scala.deser.BigDecimalDese

Re: [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin / Various Errors

2015-12-09 Thread Josh Mahonin
Hi Jonathan,

Thanks, I'm digging into this as we speak. That SPARK-8332 issue looks like
the same issue, and to quote one of the comments in that issue 'Classpath
hell is hell'.

What is interesting is that the unit tests in Phoenix 4.6.0 successfully
run against Spark 1.5.2 [1], so I wonder if this is issue is specific to
the spark-shell. You may have some success compiling your app as an
assembly JAR and submitting it to a Spark cluster instead.

Could you do me a favour and file a JIRA ticket for this, and copy all the
relevant information you've posted there?

Thanks!

Josh

[1]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala

On Wed, Dec 9, 2015 at 2:52 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:

> Josh,
>
>
>
> I’d like to give you a little more information regarding this error. It
> looks like when I add the Phoenix Client JAR to Spark, it causes Spark to
> fail:
>
> spark.executor.extraClassPath
> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>
> spark.driver.extraClassPath
> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>
>
>
> After adding this JAR, I get the following error when excuting the
> following command:
>
> scala> val textFile = sc.textFile("README.md")
>
> java.lang.NoSuchMethodError:
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
>
> at
> com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala:49)
>
> at
> com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala)
>
>
>
> As you can see, adding this phoenix JAR is breaking other Spark
> functionality for me. My naïve guess is that there is a different version
> of the Jackson FasterXML classes packaged inside
> phoenix-4.6.0-HBase-1.1-client.jar that is breaking Spark.
>
>
>
> Have you seen anything like this before?
>
>
>
> Regards,
>
> Jonathan
>
>
>
> *From:* Cox, Jonathan A [mailto:ja...@sandia.gov]
> *Sent:* Wednesday, December 09, 2015 11:58 AM
> *To:* user@phoenix.apache.org
> *Subject:* [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin /
> Various Errors
>
>
>
> Josh,
>
>
>
> So using user provided Hadoop 2.6 solved the immediate Phoenix / Spark
> integration problem I was having. However, I now have another problem,
> which seems to be similar to:
>
> https://issues.apache.org/jira/browse/SPARK-8332
>
> java.lang.NoSuchMethodError:
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
>
>
>
> I’m getting this error when executing the simple example in the Phoenix /
> Spark Plugin page:
>
> Spark context available as sc.
>
> 15/12/09 11:51:02 INFO repl.SparkILoop: Created sql context..
>
> SQL context available as sqlContext.
>
>
>
> scala> val df = sqlContext.load(
>
>  |   "org.apache.phoenix.spark",
>
>  |   Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
>
>  | )
>
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
>
> java.lang.NoSuchMethodError:
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
>
>
>
> I did try upgrading the Hadoop Jackson JARs from 2.2.3 to 2.4.3, as some
> suggested in the link above, and including them in Spark’s classpath.
> However, the error was the same.
>
>
>
> *From:* Josh Mahonin [mailto:jmaho...@gmail.com <jmaho...@gmail.com>]
> *Sent:* Wednesday, December 09, 2015 11:21 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: [EXTERNAL] Re: Confusion Installing Phoenix Spark Plugin /
> Various Errors
>
>
>
> Definitely. I'd like to dig into what the root cause is, but it might be
> optimistic to think I'll be able to get to that any time soon.
>
>
>
> I'll try get the docs updated today.
>
>
>
> On Wed, Dec 9, 2015 at 1:09 PM, James Taylor <jamestay...@apache.org>
> wrote:
>
> Would it make sense to tweak the Spark installation instructions slightly
> with this information, Josh?
>
>
>
> On Wed, Dec 9, 2015 at 9:11 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:
>
> Josh,
>
>
>
> Previously, I was using the SPARK_CLASSPATH, but then read that it was
> deprecated and switched to the spark-defaults.conf file. The result was the
> same.
>
>
>
> Also, I was using ‘spark-1.5.2-bin-hadoop2.6.tgz’, which includes some
> Hadoop 2.6 JARs. This caused the trouble. However, by separately
> downloading Had

Re: Confusion Installing Phoenix Spark Plugin / Various Errors

2015-12-08 Thread Josh Mahonin
Hi Jonathan,

Spark only needs the client JAR. It contains all the other Phoenix
dependencies as well.

I'm not sure exactly what the issue you're seeing is. I just downloaded and
extracted fresh copies of Spark 1.5.2 (pre-built with user-provided
Hadoop), and the latest Phoenix 4.6.0 binary release.

I copied the 'phoenix-4.6.0-HBase-1.1-client.jar' to /tmp and created a
'spark-defaults.conf' in the 'conf' folder of the Spark install with the
following:

spark.executor.extraClassPath /tmp/phoenix-4.6.0-HBase-1.1-client.jar
spark.driver.extraClassPath /tmp/phoenix-4.6.0-HBase-1.1-client.jar

I then launched the 'spark-shell', and was able to execute:

import org.apache.phoenix.spark._

>From there, you should be able to use the methods provided by the
phoenix-spark integration within the Spark shell.

Good luck,

Josh

On Tue, Dec 8, 2015 at 8:51 PM, Cox, Jonathan A  wrote:

> I am trying to get Spark up and running with Phoenix, but the installation
> instructions are not clear to me, or there is something else wrong. I’m
> using Spark 1.5.2, HBase 1.1.2 and Phoenix 4.6.0 with a standalone install
> (no HDFS or cluster) with Debian Linux 8 (Jessie) x64. I’m also using Java
> 1.8.0_40.
>
>
>
> The instructions state:
>
> 1.   Ensure that all requisite Phoenix / HBase platform dependencies
> are available on the classpath for the Spark executors and drivers
>
> 2.   One method is to add the phoenix-4.4.0-client.jar to
> ‘SPARK_CLASSPATH’ in spark-env.sh, or setting both
> ‘spark.executor.extraClassPath’ and ‘spark.driver.extraClassPath’ in
> spark-defaults.conf
>
>
>
> *First off, what are “all requisite Phoenix / HBase platform
> dependencies”?* #2 suggests that all I need to do is add
>  ‘phoenix-4.6.0-HBase-1.1-client.jar’ to Spark’s class path. But what about
> ‘phoenix-spark-4.6.0-HBase-1.1.jar’ or ‘phoenix-core-4.6.0-HBase-1.1.jar’?
> Do either of these (or anything else) need to be added to Spark’s class
> path?
>
>
>
> Secondly, if I follow the instructions exactly, and add only
> ‘phoenix-4.6.0-HBase-1.1-client.jar’ to ‘spark-defaults.conf’:
>
> spark.executor.extraClassPath
> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>
> spark.driver.extraClassPath
> /usr/local/phoenix/phoenix-4.6.0-HBase-1.1-client.jar
>
> Then I get the following error when starting the interactive Spark shell
> with ‘spark-shell’:
>
> 15/12/08 18:38:05 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
>
> 15/12/08 18:38:05 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
>
> 15/12/08 18:38:05 WARN Hive: Failed to access metastore. This class should
> not accessed in runtime.
>
> org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.RuntimeException: Unable to instantiate
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>
> at
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
>
> …
>
>
>
> :10: error: not found: value sqlContext
>
>import sqlContext.implicits._
>
>   ^
>
> :10: error: not found: value sqlContext
>
>import sqlContext.sql
>
>
>
> On the other hand, if I include all three of the aforementioned JARs, I
> get the same error. However, *if I include only the
> ‘phoenix-spark-4.6.0-HBase-1.1.jar’*, spark-shell seems so launch without
> error. Nevertheless, if I then try the simple tutorial commands in
> spark-shell, I get the following:
>
> *Spark output:* SQL context available as sqlContext.
>
>
>
> *scala >>* import org.apache.spark.SparkContext
>
> import org.apache.spark.sql.SQLContext
>
> import org.apache.phoenix.spark._
>
>
>
> val sqlContext = new SQLContext(sc)
>
>
>
> val df =
> sqlContext.load("org.apache.phoenix.spark", Map("table" -> "TABLE1",
> "zkUrl" -> "phoenix-server:2181")
>
>
>
> *Spark error:*
>
> *java.lang.NoClassDefFoundError:
> org/apache/hadoop/hbase/HBaseConfiguration*
>
> at
> org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:71)
>
> at
> org.apache.phoenix.spark.PhoenixRDD.phoenixConf$lzycompute(PhoenixRDD.scala:39)
>
> at
> org.apache.phoenix.spark.PhoenixRDD.phoenixConf(PhoenixRDD.scala:38)
>
> at
> org.apache.phoenix.spark.PhoenixRDD.(PhoenixRDD.scala:42)
>
> at
> org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:50)
>
> at
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
>
>
>
> This final error seems similar to the one in mailing list post Phoenix-spark
> : NoClassDefFoundError: HBaseConfiguration
> 

Re: Problem with arrays in phoenix-spark

2015-11-30 Thread Josh Mahonin
Hi David,

Thanks for the bug report and the proposed patch. Please file a JIRA and
we'll take the discussion there.

Josh

On Mon, Nov 30, 2015 at 1:01 PM, Dawid Wysakowicz <
wysakowicz.da...@gmail.com> wrote:

> Hi,
>
> I've recently found some behaviour that I found buggy when working with
> phoenix-spark and arrays.
>
> Take a look at those unit tests:
>
>   test("Can save arrays from custom dataframes back to phoenix") {
> val dataSet = List(Row(2L, Array("String1", "String2", "String3")))
>
> val sqlContext = new SQLContext(sc)
>
> val schema = StructType(
> Seq(StructField("ID", LongType, nullable = false),
> StructField("VCARRAY", ArrayType(StringType
>
> val rowRDD = sc.parallelize(dataSet)
>
> // Apply the schema to the RDD.
> val df = sqlContext.createDataFrame(rowRDD, schema)
>
> df.write
>   .format("org.apache.phoenix.spark")
>   .options(Map("table" -> "ARRAY_TEST_TABLE", "zkUrl" ->
> quorumAddress))
>   .mode(SaveMode.Overwrite)
>   .save()
>   }
>
>   test("Can save arrays of AnyVal type back to phoenix") {
> val dataSet = List((2L, Array(1, 2, 3), Array(1L, 2L, 3L)))
>
> sc
>   .parallelize(dataSet)
>   .saveToPhoenix(
> "ARRAY_ANYVAL_TEST_TABLE",
> Seq("ID", "INTARRAY", "BIGINTARRAY"),
> zkUrl = Some(quorumAddress)
>   )
>
> // Load the results back
> val stmt = conn.createStatement()
> val rs = stmt.executeQuery("SELECT INTARRAY, BIGINTARRAY FROM
> ARRAY_ANYVAL_TEST_TABLE WHERE ID = 2")
> rs.next()
> val intArray = rs.getArray(1).getArray().asInstanceOf[Array[Int]]
> val longArray = rs.getArray(2).getArray().asInstanceOf[Array[Long]]
>
> // Verify the arrays are equal
> intArray shouldEqual dataSet(0)._2
> longArray shouldEqual dataSet(0)._3
>   }
>
> Both fail with some ClassCastExceptions.
>
> In attached patch I've proposed a solution. The tricky part is with
> Array[Byte] as this would be same for both VARBINARY and TINYINT[].
>
> Let me know If I should create an issue for this, and if my solution
> satisfies you.
>
> Regards
> Dawid Wysakowicz
>
>
>


RE: Phoenix-spark : NoClassDefFoundError: HBaseConfiguration

2015-11-06 Thread Josh Mahonin
Have you tried setting the SPARK_CLASSPATH or spark.driver.extraClassPath / 
spark.executor.extraClassPath as specified here?
https://phoenix.apache.org/phoenix_spark.html

Spark treats the JARs passed in with '--jars', and the class path, a little 
differently.

From: vishnu prasad [vishnu...@gmail.com]
Sent: Tuesday, November 03, 2015 4:50 AM
To: user@phoenix.apache.org
Subject: Phoenix-spark : NoClassDefFoundError: HBaseConfiguration

I'm using phoenix-spark to load data from phoenix-hbase

spark-version : 1.5.1
Hbase Version : 1.1.2
Phoenix Version : phoenix-4.6.0-HBase-1.1

I'm running spark-shell in standalone mode with the following jars added

  *   phoenix-spark-4.6.0-HBase-1.1.jar
  *   phoenix-4.6.0-HBase-1.1-client.jar
  *   phoenix-core-4.6.0-HBase-1.1.jar

spark-1.5.1-bin-hadoop2.6/bin/spark-shell --jars 
/opt/phoenix/phoenix-4.6.0-HBase-1.1-bin/phoenix-spark-4.6.0-HBase-1.1.jar 
/opt/phoenix/phoenix-4.6.0-HBase-1.1-bin/phoenix-4.6.0-HBase-1.1-client.jar 
/opt/phoenix/phoenix-4.6.0-HBase-1.1-bin/phoenix-core-4.6.0-HBase-1.1.jar


import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
val sqlcontext = new SQLContext(sc)
val df = sqlcontext.load("org.apache.phoenix.spark",Map("table" -> "CCPP", 
"zkUrl" -> "localhost:2181"))

Gives me the following stack-strace
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at 
org.apache.phoenix.spark.PhoenixRDD.getPhoenixConfiguration(PhoenixRDD.scala:71)
at 
org.apache.phoenix.spark.PhoenixRDD.phoenixConf$lzycompute(PhoenixRDD.scala:39)
at org.apache.phoenix.spark.PhoenixRDD.phoenixConf(PhoenixRDD.scala:38)
at org.apache.phoenix.spark.PhoenixRDD.(PhoenixRDD.scala:42)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:50)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1203)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC.(:35)
at $iwC$$iwC$$iwC.(:37)
at $iwC$$iwC.(:39)
at $iwC.(:41)
at (:43)
at .(:47)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at 

Re: integration Phoenix and Spark

2015-09-29 Thread Josh Mahonin
Make sure to double check your imports. Note the following from 
https://phoenix.apache.org/phoenix_spark.html


import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._

There's also a sample repository here: 
https://github.com/jmahonin/spark-graphx-phoenix

From: Hardika Catur Sapta
Reply-To: "user@phoenix.apache.org"
Date: Tuesday, September 29, 2015 at 5:28 AM
To: "user@phoenix.apache.org"
Subject: Re: integration Phoenix and Spark

/spark/Project Spark$ scala SavingPhoenix.scala
/home/hduser/spark/Project Spark/SavingPhoenix.scala:1: error: object spark is 
not a member of package phoenix.org.apache
import phoenix.org.apache.spark.SparkContext
  ^
/home/hduser/spark/Project Spark/SavingPhoenix.scala:4: error: not found: type 
SparkContext
val sc = new SparkContext("local", "phoenix-test")
 ^
two errors found


2015-09-29 16:20 GMT+07:00 Konstantinos Kougios 
>:
Hi,

Just to add that, at least for hadoop-2.7.1 and phoenix 4.5.2-HBase-1.1, hadoop 
guava lib has to be patched to 14.0.1 (under hadoop/share/hadoop/common/lib) 
otherwise spark tasks might fail due to missing guava methods.

Cheers


On 29/09/15 10:17, Hardika Catur Sapta wrote:
Spark setup

  1.  Ensure that all requisite Phoenix / HBase platform dependencies are 
available on the classpath for the Spark executors and drivers

  2.  One method is to add the phoenix-4.4.0-client.jar to ‘SPARK_CLASSPATH’ in 
spark-env.sh, or setting both ‘spark.executor.

  3.  To help your IDE, you may want to add the following ‘provided’ dependency


sorry for bad English.

intent to number 2 and 3 how ??

please explain step by step.


Thanks.





Re: Spark Plugin Exception - java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

2015-09-23 Thread Josh Mahonin
Hi Babar,

Can you file a JIRA for this? I suspect this is something to do with the Spark 
1.5 data frame API data structures, perhaps they've gone and changed them again!

Can you try with previous Spark versions to see if there's a difference? Also, 
you may have luck interfacing with the RDDs directly instead of the data frames.

Thanks!

Josh

From: Babar Tareen
Reply-To: "user@phoenix.apache.org"
Date: Tuesday, September 22, 2015 at 5:47 PM
To: "user@phoenix.apache.org"
Subject: Spark Plugin Exception - java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row

Hi,

I am trying to run the spark plugin DataFrame sample code available here 
(https://phoenix.apache.org/phoenix_spark.html) and getting following 
exception.  I am running the code against hbase-1.1.1, spark 1.5.0 and phoenix  
4.5.2. HBase is running in standalone mode, locally on OS X.  Any ideas what 
might be causing this exception?


java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439) 
~[spark-sql_2.11-1.5.0.jar:1.5.0]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:366)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
 ~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.scheduler.Task.run(Task.scala:88) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_45]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]

Thanks,
Babar


Re: Spark Plugin Exception - java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

2015-09-23 Thread Josh Mahonin
I've got a patch attached to the ticket that I think should fix your issue.

If you're able to try it out and let us know how it goes, it'd be much 
appreciated.

From: Babar Tareen
Reply-To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>"
Date: Wednesday, September 23, 2015 at 1:14 PM
To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>"
Subject: Re: Spark Plugin Exception - java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row

I have filed PHOENIX-2287<https://issues.apache.org/jira/browse/PHOENIX-2287> 
for this. And the code works fine with Spark 1.4.1.

Thanks

On Wed, Sep 23, 2015 at 6:06 AM Josh Mahonin 
<jmaho...@interset.com<mailto:jmaho...@interset.com>> wrote:
Hi Babar,

Can you file a JIRA for this? I suspect this is something to do with the Spark 
1.5 data frame API data structures, perhaps they've gone and changed them again!

Can you try with previous Spark versions to see if there's a difference? Also, 
you may have luck interfacing with the RDDs directly instead of the data frames.

Thanks!

Josh

From: Babar Tareen
Reply-To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>"
Date: Tuesday, September 22, 2015 at 5:47 PM
To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>"
Subject: Spark Plugin Exception - java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row

Hi,

I am trying to run the spark plugin DataFrame sample code available here 
(https://phoenix.apache.org/phoenix_spark.html) and getting following 
exception.  I am running the code against hbase-1.1.1, spark 1.5.0 and phoenix  
4.5.2. HBase is running in standalone mode, locally on OS X.  Any ideas what 
might be causing this exception?


java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439) 
~[spark-sql_2.11-1.5.0.jar:1.5.0]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) 
~[scala-library-2.11.4.jar:na]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:366)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org<http://1.org>$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
 ~[spark-sql_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
 ~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.scheduler.Task.run(Task.scala:88) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
~[spark-core_2.11-1.5.0.jar:1.5.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_45]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]

Thanks,
Babar


Re: setting up community repo of Phoenix for CDH5?

2015-09-14 Thread Josh Mahonin
On Mon, Sep 14, 2015 at 9:21 AM, James Heather 
wrote:

> I'm not certain of the best way to manage this. Perhaps we need a new
> mailing list for those who want to help, to avoid cluttering this list up.


Just my opinion, but maybe a tag in the email subject, something like [CDH]
as a prefix? I don't know if list clutter is an issue to worry about yet. :)

Good luck!

Josh


Re: JOIN issue, getting errors

2015-09-09 Thread Josh Mahonin
This looks suspiciously like
https://issues.apache.org/jira/browse/PHOENIX-2169

Are you able to perform the same queries when the tables aren't salted?
Also, what versions of HBase / Phoenix are you using?

On Tue, Sep 8, 2015 at 12:33 PM, M. Aaron Bossert 
wrote:

> All,
>
> Any help would be greatly appreciated...
>
> I have two tables with the following structure:
>
> CREATE TABLE NG.BARS_Cnc_Details_Hist (
>
> ip  varchar(30) not null,
>
> last_active date not null,
>
> cnc_type varchar(5),
>
> cnc_value varchar(50),
>
> pull_date date CONSTRAINT cnc_pk PRIMARY KEY(ip,last_active,cnc_value))
> SALT_BUCKETS=48;
>
>
> create table NG.IID_IP_Threat_Hist (
>
> IPaddr varchar(50) not null,
>
> Original_Date varchar(16),
>
> Classname varchar(50),
>
> Category varchar(30),
>
> Threat_Date date,
>
> Time varchar(8),
>
> Created date,
>
> VENDOR_ID integer
>
> CONSTRAINT pk PRIMARY KEY(IPaddr)) SALT_BUCKETS=48;
>
>
> When I run a query against each individual table, I get the expected
> results...but when I run a query that is a JOIN of the two (query below), I
> get errors related to IO/corruption:
>
>
> SELECT
>
>   NG.BARS_CNC_DETAILS_HIST.IP,
>
>   NG.BARS_CNC_DETAILS_HIST.LAST_ACTIVE,
>
>   NG.BARS_CNC_DETAILS_HIST.CNC_TYPE,
>
>   NG.BARS_CNC_DETAILS_HIST.CNC_VALUE,
>
>   NG.BARS_CNC_DETAILS_HIST.PULL_DATE,
>
>   NG.IID_IP_THREAT_HIST.IPADDR,
>
>   NG.IID_IP_THREAT_HIST.CLASSNAME,
>
>   NG.IID_IP_THREAT_HIST.CATEGORY,
>
>   NG.IID_IP_THREAT_HIST.THREAT_DATE,
>
>   NG.IID_IP_THREAT_HIST.TIME
>
> FROM
>
>   NG.BARS_CNC_DETAILS_HIST
>
> INNER JOIN
>
>   NG.IID_IP_THREAT_HIST
>
> ON
>
>   NG.BARS_CNC_DETAILS_HIST.IP=NG.IID_IP_THREAT_HIST.IPADDR;
>
>
> 15/09/08 11:17:56 WARN ipc.CoprocessorRpcChannel: Call failed on
> IOException
>
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=35, exceptions:
>
> Tue Sep 08 11:08:36 CDT 2015,
> org.apache.hadoop.hbase.client.RpcRetryingCaller@768f318b,
> java.io.IOException: java.io.IOException: java.lang.NoClassDefFoundError:
> org/iq80/snappy/CorruptionException
>
> at
> org.apache.phoenix.coprocessor.ServerCachingEndpointImpl.addServerCache(ServerCachingEndpointImpl.java:78)
>
> at
> org.apache.phoenix.coprocessor.generated.ServerCachingProtos$ServerCachingService.callMethod(ServerCachingProtos.java:3200)
>
> at
> org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:6864)
>
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.execServiceOnRegion(HRegionServer.java:3415)
>
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.execService(HRegionServer.java:3397)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29998)
>
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
>
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
>
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.lang.NoClassDefFoundError:
> org/iq80/snappy/CorruptionException
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:190)
>
> at
> org.apache.phoenix.coprocessor.ServerCachingEndpointImpl.addServerCache(ServerCachingEndpointImpl.java:72)
>
> ... 10 more
>
> Caused by: java.lang.ClassNotFoundException:
> org.iq80.snappy.CorruptionException
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> ... 13 more
>


Re: Get values that caused the exception

2015-09-03 Thread Josh Mahonin
Hi Yiannis,

I've found the best solution to this is generally just to add logging
around that area. For example, you could add a try (or Scala Try<>) and
check if an exception has been thrown, then log it somewhere.

As a wild guess, if you're dealing with a Double datatype and getting
NumberFormatException, is it possible one of your values is a NaN?

Josh

On Thu, Sep 3, 2015 at 6:11 AM, Yiannis Gkoufas 
wrote:

> Hi there,
>
> I am using phoenix-spark to insert multiple entries on a phoenix table.
> I get the following errors:
>
> ..Exception while committing to database..
> ..Caused by: java.lang.NumberFormatException..
>
> I couldn't find on the logs what was the row that was causing the issue.
> Is it possible to extract somehow the (wrong) values that I am trying to
> insert in this column of Double type?
>
> Thanks a lot!
>


Re: Getting ArrayIndexOutOfBoundsException in UPSERT SELECT

2015-08-26 Thread Josh Mahonin
Hi Jaime,

I've run into similar issues with Phoenix 4.2.x. They seem to have gone
away for me as of 4.3.1. Are you able to upgrade to that version or higher
and try those same queries?

Josh

On Wed, Aug 26, 2015 at 3:30 PM, Jaime Solano jdjsol...@gmail.com wrote:

 Hi Yiannis,

 Not quite, but it's similar. For instance, in my case the tables are not
 salted.

 I forgot to add, I'm using Phoenix 4.2.0.

 Thanks for your reply!
 -Jaime

 On Wed, Aug 26, 2015 at 3:20 PM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Jaime,

 is this https://issues.apache.org/jira/browse/PHOENIX-2169 the error you
 are getting?

 Thanks

 On 26 August 2015 at 19:52, Jaime Solano jdjsol...@gmail.com wrote:

 Hi guys,

 I'm getting *Error: java.lang.ArrayIndexOutOfBoundsException
 (state=08000, code=101)*, while doing a query like the following:

 UPSERT INTO A (PK, COL2)
 SELECT A.PK, A.COL1 * B.COL1
 FROM A LEFT JOIN B ON A.COL3=B.COL2;

 Some additional info:
 - Table A has around 140k records, while table B only has around 100.
 - In reality, Table A has around 140 columns, but they're not part of
 the UPSERT in any way.
 - A.COL2 is DECIMAL.
 - A.COL1 is FLOAT
 - B.COL1 is DECIMAL.
 *- Running only the SELECT statement, successfully prints the results on
 the screen.*

 To be honest, I have no idea what to look for or where. Any ideas?

 Thanks,
 -Jaime






Re: ERROR 201 (22000): Illegal data on Upsert Select

2015-08-20 Thread Josh Mahonin
Tracking in https://issues.apache.org/jira/browse/PHOENIX-2169

On Thu, Aug 20, 2015 at 5:14 PM, Samarth Jain sama...@apache.org wrote:

 Yiannis,

 Can you please provide a reproducible test case (schema, minimum data to
 reproduce the error) along with the phoenix and hbase versions so we can
 take a look at it further.

 Thanks,
 Samarh

 On Thu, Aug 20, 2015 at 2:09 PM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi there,

 I am getting an error while executing:

 UPSERT INTO READINGS
 SELECT R.SMID, R.DT, R.US, R.GEN, R.USEST, R.GENEST, RM.LAT, RM.LON,
 RM.ZIP, RM.FEEDER
 FROM READINGS AS R
 JOIN
 (SELECT SMID,LAT,LON,ZIP,FEEDER
  FROM READINGS_META) AS RM
 ON R.SMID = RM.SMID

 the full stacktrace is:

 Error: ERROR 201 (22000): Illegal data. ERROR 201 (22000): Illegal data.
 Expected length of at least 70 bytes, but had 25 (state=22000,code=201)
 java.sql.SQLException: ERROR 201 (22000): Illegal data. ERROR 201
 (22000): Illegal data. Expected length of at least 70 bytes, but had 25
 at
 org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:388)
 at
 org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
 at
 org.apache.phoenix.util.ServerUtil.parseRemoteException(ServerUtil.java:131)
 at
 org.apache.phoenix.util.ServerUtil.parseServerExceptionOrNull(ServerUtil.java:115)
 at
 org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:104)
 at
 org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:553)
 at
 org.apache.phoenix.iterate.RoundRobinResultIterator.getIterators(RoundRobinResultIterator.java:176)
 at
 org.apache.phoenix.iterate.RoundRobinResultIterator.next(RoundRobinResultIterator.java:91)
 at
 org.apache.phoenix.iterate.DelegateResultIterator.next(DelegateResultIterator.java:44)
 at
 org.apache.phoenix.compile.UpsertCompiler$2.execute(UpsertCompiler.java:685)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:314)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:306)
 at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
 at
 org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:304)
 at
 org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:1374)
 at sqlline.Commands.execute(Commands.java:822)
 at sqlline.Commands.sql(Commands.java:732)
 at sqlline.SqlLine.dispatch(SqlLine.java:808)
 at sqlline.SqlLine.begin(SqlLine.java:681)
 at sqlline.SqlLine.start(SqlLine.java:398)
 at sqlline.SqlLine.main(SqlLine.java:292)

 I was wondering whether its something wrong with my data or its a bug of
 Phoenix.
 How can I debug this error and identify what row is problematic?
 But the way, the SELECT statement on its own works fine.

 Thanks a lot!





Re: REG: Using Sequences in Phoenix Data Frame

2015-08-17 Thread Josh Mahonin
Oh, neat! I was looking for some references to it in code, unit tests and
docs and didn't see anything relevant.

It's possible they might just work then, although it's definitely an
untested scenario.

On Mon, Aug 17, 2015 at 10:48 AM, James Taylor jamestay...@apache.org
wrote:

 Sequences are supported by MR integration, but I'm not sure if their
 usage by the Spark integration would cause any issues.


 On Monday, August 17, 2015, Josh Mahonin jmaho...@interset.com wrote:

 Hi Satya,

 I don't believe sequences are supported by the broader Phoenix map-reduce
 integration, which the phoenix-spark module uses under the hood.

 One workaround that would give you sequential IDs, is to use the
 'zipWithIndex' method on the underlying Spark RDD, with a small 'map()'
 operation to unpack / reorganize the tuple, before saving it to Phoenix.

 Good luck!

 Josh

 On Sat, Aug 15, 2015 at 10:02 AM, Ns G nsgns...@gmail.com wrote:

 Hi All,

 I hope that someone will reply to this email as all my previous emails
 have been unanswered.

 I have 10-20 Million records in file and I want to insert it through
 Phoenix-Spark.
 The table primary id is generated by a sequence. So, every time an
 upsert is done, the sequence Id gets generated.

 Now I want to implement this in Spark and more precisely using data
 frames. Since RDDs are immutables, How can I add sequence to the rows in
 dataframe?

 Thanks for any help or direction or suggestion.

 Satya





Re: Exception trying to write an ARRAY of UNSIGNED_SMALLINT

2015-08-04 Thread Josh Mahonin
Hi Riccardo,

I think you've run into a bit of a mismatch between Scala and Java types.
Could you please file a JIRA ticket for this with all the info above?

You should be able to work around this by first converting your array
contents to be java.lang.Short. I just tried this out and it worked for me:

DDL:
CREATE TABLE ARRAY_TEST_TABLE_SHORT (ID BIGINT NOT NULL PRIMARY KEY,
SHORTARRAY SMALLINT[]);

Spark:

val dataSet = List((1L, Array[java.lang.Short](1.toShort, 2.toShort,
3.toShort)))
sc.parallelize(dataSet).saveToPhoenix(ARRAY_TEST_TABLE_SHORT,
Seq(ID,SHORTARRAY), zkUrl = Some(localhost))


Best of luck,

Josh



On Tue, Aug 4, 2015 at 6:49 AM, Riccardo Cardin riccardo.car...@gmail.com
wrote:

 Hi all,

 I am using Phoenix version 4.5.0 and the phoenix-spark plugin to write
 into HBase an ARRAY of UNSIGNED_SMALLINT. As stated in the documentation,
 this type is mapped to the java type java.lang.Short.

 Using the saveToPhoenix method on a RDD and passing a Scala Array of Short
 I obtain the following stacktrace:

 Caused by: java.lang.ClassCastException: *[S cannot be cast to
 [Ljava.lang.Object;*
 at
 org.apache.phoenix.schema.types.PUnsignedSmallintArray.isCoercibleTo(PUnsignedSmallintArray.java:81)
 at
 org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:174)
 at
 org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:157)
 at
 org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:144)
 at
 org.apache.phoenix.compile.UpsertCompiler$UpsertValuesCompiler.visit(UpsertCompiler.java:872)
 at
 org.apache.phoenix.compile.UpsertCompiler$UpsertValuesCompiler.visit(UpsertCompiler.java:856)
 at org.apache.phoenix.parse.BindParseNode.accept(BindParseNode.java:47)
 at
 org.apache.phoenix.compile.UpsertCompiler.compile(UpsertCompiler.java:745)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$ExecutableUpsertStatement.compilePlan(PhoenixStatement.java:550)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$ExecutableUpsertStatement.compilePlan(PhoenixStatement.java:538)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:318)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:311)
 at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
 at
 org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:309)
 at
 org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:239)
 at
 org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:173)
 at
 org.apache.phoenix.jdbc.PhoenixStatement.executeBatch(PhoenixStatement.java:1315)

 Changing the type of the column to CHAR(1) ARRAY and use an Array of
 String, the write operation succeds.

 What am I doing wrong?

 Thanks a lot,
 Riccardo
 --



Re: Using phoenix-spark plugin to insert an ARRAY Type

2015-07-30 Thread Josh Mahonin
Hi Riccardo,

For saving arrays, you can use the plain old scala Array type. You can see
the tests for an example:
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala#L408-L427

Note that saving arrays is only supported in Phoenix 4.5.0, although the
patch is quite small if you need to apply it yourself:
https://issues.apache.org/jira/browse/PHOENIX-1968

Good luck!

Josh

On Thu, Jul 30, 2015 at 5:35 AM, Riccardo Cardin riccardo.car...@gmail.com
wrote:

 Hi all,

 I have a problem with the phoenix-spark plugin. I have a Spark RDD that I
 have to store inside an HBase table. We use the Apache-phoenix layer to
 dialog with the database. There is a column of the table that is defined as
 an UNSIGNED_SMALLINT ARRAY:

 CREATE TABLE EXAMPLE (..., Col10 UNSIGNED_SMALLINT ARRAY, ...);

 As stated in the Phoenix documentation, ARRAY data type is backend up by
 the java.sql.Array.

 I'm using the *phoenix-spark* plugin to save data of the RDD inside the
 table. The problem is that I don't know how to create an instance of
 java.sql.Array, not having any kind of Connection object. An extract of
 the code follows (code is in Scala language):

 // Map RDD into an RDD of sequences or tuples
 rdd.map {
   value =
 (/* ... */
  value.getArray(),   // Array of Int to convert into an java.sql.Array
  /* ... */
 )}.saveToPhoenix(EXAMPLE, Seq(/* ... */, Col10, /* ... */), conf, 
 zkUrl)

 Which is the correct way of go on? Is there a way to do want I need?

 Best regards,

 Riccardo



Re: HBase's checkAndPut, Timestamp in Phoenix-Spark API

2015-07-18 Thread Josh Mahonin
Hi,

The phoenix-spark integration is a thin wrapper around the
phoenix-mapreduce integration, which under the hood just uses Phoenix's
'UPSERT' functionality for saving. As far as I know, there's no provisions
for checkAndPut functionality there, so if you require it, I suggest
sticking to the HBase API for now.

Re: timestamp, the phoenix-spark API only supports using the native Phoenix
time formats, i.e. DATE, TIMESTAMP.

Good luck,

Josh

On Fri, Jul 17, 2015 at 11:40 PM, kaye ann kayeann_po...@yahoo.com wrote:

 I am using Spark 1.3, HBase 1.1 and Phoenix 4.4. I have this in my code:

 val rdd = processedRdd.map(r = Row.fromSeq(r))
 val dataframe = sqlContext.createDataFrame(rdd, schema)
 dataframe.save(org.apache.phoenix.spark, SaveMode.Overwrite,
 Map(table - HTABLE, zkUrl - zkQuorum))

 This code works, but...
 1. How do I implement HBase's checkAndPut using Phoenix-Spark API?
 CREATED_DATE is always set to DateTime.now() in the dataframe.
 I don't want the field to be updated if the row already exists in HBase,
 yet there's an update in other fields.
 I can achieve it using HBase's checkAndPut: Put all the fields and use
 checkAndPut on created_date field.
 2. How do I add an HBase Timestamp using Phoenix-Spark similiar to HBase
 API:
 Put(rowkey, timestamp.getMillis)
 -
 This is my code using HBase API that I am trying to convert to
 Phoenix-Spark since I think Phoenix-Spark is more optimized:

 rdd.foreachPartition(p = {
   val conf = HBaseConfiguration.create()
   val hTable = new HTable(conf, HTABLE)
   hTable.setAutoFlushTo(false)

   p.foreach(r = {
 val hTimestamp = ...
 val rowkey = ...

 val hRow = new Put(rowkey, hTimestamp.getMillis)
 r.filter(...).foreach(tuple =
   hRow.add(toBytes(tuple._1), toBytes(tuple._2), toBytes(tuple._3))
 )
 hTable.put(hRow)

 val CREATED_DATE_PUT = new Put(rowkey, hTimestamp.getMillis)
   .add(toBytes(CF), toBytes(CREATED_DATE), toBytes(now))
 hTable.checkAndPut(rowkey, toBytes(CF), toBytes(CREATED_DATE), null, 
 CREATED_DATE_PUT)

   })
   hTable.flushCommits()
   hTable.close()
 })




Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-09 Thread Josh Mahonin
This may or may not be helpful for your classpath issues, but I wanted to
verify that basic functionality worked, so I made a sample app here:

https://github.com/jmahonin/spark-streaming-phoenix

This consumes events off a Kafka topic using spark streaming, and writes
out event counts to Phoenix using the new phoenix-spark functionality:
http://phoenix.apache.org/phoenix_spark.html

It's definitely overkill, and would probably be more efficient to use the
JDBC driver directly, but it serves as a proof-of-concept.

I've only tested this in local mode. To convert it to a full jobs JAR, I
suspect that keeping all of the spark and phoenix dependencies marked as
'provided', and including the Phoenix client JAR in the Spark classpath
would work as well.

Good luck,

Josh

On Tue, Jun 9, 2015 at 4:40 AM, Jeroen Vlek j.v...@anchormen.nl wrote:

 Hi,

 I posted a question with regards to Phoenix and Spark Streaming on
 StackOverflow [1]. Please find a copy of the question to this email below
 the
 first stack trace. I also already contacted the Phoenix mailing list and
 tried
 the suggestion of setting spark.driver.userClassPathFirst. Unfortunately
 that
 only pushed me further into the dependency hell, which I tried to resolve
 until I hit a wall with an UnsatisfiedLinkError on Snappy.

 What I am trying to achieve: To save a stream from Kafka into
 Phoenix/Hbase
 via Spark Streaming. I'm using MapR as a platform and the original
 exception
 happens both on a 3-node cluster, as on the MapR Sandbox (a VM for
 experimentation), in YARN and stand-alone mode. Further experimentation
 (like
 the saveAsNewHadoopApiFile below), was done only on the sandbox in
 standalone
 mode.

 Phoenix only supports Spark from 4.4.0 onwards, but I thought I could
 use a naive implementation that creates a new connection for
 every RDD from the DStream in 4.3.1.  This resulted in the
 ClassNotFoundException described in [1], so I switched to 4.4.0.

 Unfortunately the saveToPhoenix method is only available in Scala. So I did
 find the suggestion to try it via the saveAsNewHadoopApiFile method [2]
 and an
 example implementation [3], which I adapted to my own needs.

 However, 4.4.0 + saveAsNewHadoopApiFile  raises the same
 ClassNotFoundExeption, just a slightly different stacktrace:

   java.lang.RuntimeException: java.sql.SQLException: ERROR 103
 (08004): Unable to establish connection.
 at

 org.apache.phoenix.mapreduce.PhoenixOutputFormat.getRecordWriter(PhoenixOutputFormat.java:58)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:995)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.sql.SQLException: ERROR 103 (08004): Unable to
 establish connection.
 at

 org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:386)
 at

 org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:288)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:171)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1881)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1860)
 at

 org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1860)
 at

 org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:162)
 at

 org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:131)
 at
 org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:133)
 at
 java.sql.DriverManager.getConnection(DriverManager.java:571)
 at
 java.sql.DriverManager.getConnection(DriverManager.java:187)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:92)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:80)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:68)
 at

 org.apache.phoenix.mapreduce.PhoenixRecordWriter.init(PhoenixRecordWriter.java:49)
 

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-08 Thread Josh Mahonin
Hi Jeroen,

Have you tried using the phoenix-client uber JAR in the Spark classpath?
That strategy I think is the simplest and most straight-forward, although
it may not be appropriate for all projects.

With your setup though, my guess is that Spark is preferring to use its own
versions of Hadoop JARS, versus the ones you're packaging in your jobs JAR,
or on the classpath. You may need to look at the following settings as well:
'spark.driver.userClassPathFirst'
'spark.executor.userClassPathFirst'

Good luck!

Josh

On Mon, Jun 8, 2015 at 7:39 AM, Jeroen Vlek j.v...@anchormen.nl wrote:

 Hi,

 I posted a question with regards to Phoenix and Spark Streaming on
 StackOverflow [1] and realized that I might have more luck trying it on
 here. I copied the complete question to this email as well (see below)
 If you guys deem it necessary, I can also try my luck on the Spark
 mailing list.

 Phoenix only supports Spark from 4.4.0 onwards, but I thought I could
 use a naive implementation that only creates a new connection for
 every RDD from the DStream in 4.3.1.

 Then I switched to 4.4.0, but unfortunately the saveToPhoenix method
 is only available in Scala, yet I did find the suggestion to try it via the
 saveAsNewHadoopApiFile method [2] and an example
 implementation [3]. Unfortunately, that raises the same issue, just a
 different stacktrace:

   java.lang.RuntimeException: java.sql.SQLException: ERROR 103
 (08004): Unable to establish connection.
 at

 org.apache.phoenix.mapreduce.PhoenixOutputFormat.getRecordWriter(PhoenixOutputFormat.java:58)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:995)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.sql.SQLException: ERROR 103 (08004): Unable to
 establish connection.
 at

 org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:386)
 at

 org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:288)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:171)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1881)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1860)
 at

 org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1860)
 at

 org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:162)
 at

 org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:131)
 at
 org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:133)
 at
 java.sql.DriverManager.getConnection(DriverManager.java:571)
 at
 java.sql.DriverManager.getConnection(DriverManager.java:187)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:92)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:80)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:68)
 at

 org.apache.phoenix.mapreduce.PhoenixRecordWriter.init(PhoenixRecordWriter.java:49)
 at

 org.apache.phoenix.mapreduce.PhoenixOutputFormat.getRecordWriter(PhoenixOutputFormat.java:55)
 ... 8 more
 Caused by: java.io.IOException:
 java.lang.reflect.InvocationTargetException
 at

 org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:457)
 at

 org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:350)
 at

 org.apache.phoenix.query.HConnectionFactory$HConnectionFactoryImpl.createConnection(HConnectionFactory.java:47)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:286)
 ... 23 more
 Caused by: java.lang.reflect.InvocationTargetException
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at

 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 

Re: Persisting Objects thru Phoenix

2015-03-23 Thread Josh Mahonin
Hi Anirudha,

We're presently using Phoenix with the Dropwizard framework, using JDBI:
https://dropwizard.github.io/dropwizard/manual/jdbi.html

As well, I did a small trial run with the Play 2 Framework using both Anorm
and Ebean, which were successful. We ended up choosing Dropwizard instead
though.
https://www.playframework.com/documentation/2.3.x/ScalaAnorm
https://www.playframework.com/documentation/2.3.x/JavaEbean

If I recall correctly, most things just work out of the box. Depending on
the framework, you'll need to have a way to force the UPSERT syntax vs.
UPDATE. If you end up going with JDBI, I believe we had to override a
'validation query' setting, as well as an 'auto commit' setting in the
Dropwizard DB configuration.

Best,

Josh

On Thu, Mar 19, 2015 at 10:00 AM, Anirudha Khanna akha...@marinsoftware.com
 wrote:

 Hi Thomas,

 Thanks for your response. We are looking to store our relational data in
 HBase through Phoenix. With this respect we are evaluating whether to use a
 framework like JPA or to go native SQL, using a lib like jdbi or jooq, for
 doing the serdes from the data store. Was curious if anyone in the
 community had a similar use case and how they are solving this problem.

 Cheers,
 Anirudha

 On Thu, Mar 19, 2015 at 1:39 AM, Thomas D'Silva tdsi...@salesforce.com
 wrote:

 Anirudha

 At Salesforce,  one of the use cases Phoenix and HBase is used for is
 storing immutable event data such as login information. We periodically run
 aggregate queries to generate metrics eg. number of logins per user. We
 select the columns of the primary key based on the filters used while
 querying data. Our objects don't have multi-level parent child
 relationships.
 Do you have any specific information you are looking for?

 -Thomas

 On Wed, Mar 18, 2015 at 11:28 AM, Anirudha Khanna 
 akha...@marinsoftware.com wrote:

 Hi,

 We are evaluating using Phoenix over HBase for persisting relational
 data. Has anyone tried doing something similar? Any experience reports
 would be really helpful.
 Quick note, some of our objects have upto 3 - 4 levels of parent - child
 relations.

 Cheers,
 Anirudha






Re: Fwd: java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString

2015-03-18 Thread Josh Mahonin
Have you tried adding hbase-protocol to the SPARK_CLASSPATH? That worked
for me to get Spark playing nicely with Phoenix.

On Tue, Mar 17, 2015 at 6:15 PM, Andrew Purtell apurt...@apache.org wrote:

 This is HBASE-8 (https://issues.apache.org/jira/browse/HBASE-8).
 Looks like someone else wrote in that Oozie wasn't working for them. You
 should follow up on the HBase issue tracker, although no promises, it may
 be an Oozie problem, but this is not a Phoenix issue.

 On Mon, Mar 16, 2015 at 2:39 AM, junaid khalid 
 junaid.kha...@platalytics.com wrote:

 hbase-protocol.jar is added to path. I can see that in spark-UI of the
 running application.

 phoenix-core version is 4.3.0 and phoenix-server.jar version is 4.2.3.

 On Mon, Mar 16, 2015 at 2:15 PM, Fulin Sun su...@certusnet.com.cn
 wrote:

 Hi,
 Did you add hbase-protocol.jar into your application classpath?
 Do you find some version incompatibility between your client and server?

 Thanks,
 Sun.

 --
 --

 CertusNet


 *From:* junaid khalid junaid.kha...@platalytics.com
 *Date:* 2015-03-16 16:50
 *To:* user user@phoenix.apache.org
 *Subject:* Fwd: java.lang.IllegalAccessError:
 com/google/protobuf/HBaseZeroCopyByteString
 i have a spark program in which it connects to hbase using phoenix and
 upserts record in hbase table. It runs fine when run through spark-submit
 command and work as expected. But when I run it through oozie, it gives
 following exception. While running submitting through oozie, if spark is
 run in local mode the program works fine.

 I am using spark 1.2, phoenix 4.2.3 and hbase 0.98.6-cdh5.3.1.

 ---
 15/03/16 13:13:18 WARN HTable: Error calling coprocessor service
 org.apache.phoenix.coprocessor.generated.MetaDataProtos$MetaDataService for
 row \x00\x00TABLE55068E2AED4AB9B607BBBE49
 java.util.concurrent.ExecutionException: java.lang.IllegalAccessError:
 com/google/protobuf/HBaseZeroCopyByteString
 at java.util.concurrent.FutureTask.report(FutureTask.java:122)
 at java.util.concurrent.FutureTask.get(FutureTask.java:188)
 at
 org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1583)
 at
 org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1540)
 at
 org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1006)
 at
 org.apache.phoenix.query.ConnectionQueryServicesImpl.getTable(ConnectionQueryServicesImpl.java:1257)
 at
 org.apache.phoenix.schema.MetaDataClient.updateCache(MetaDataClient.java:348)
 at
 org.apache.phoenix.schema.MetaDataClient.updateCache(MetaDataClient.java:309)
 at
 org.apache.phoenix.schema.MetaDataClient.updateCache(MetaDataClient.java:305)
 at
 org.apache.phoenix.compile.FromCompiler$BaseColumnResolver.createTableRef(FromCompiler.java:352)
 at
 org.apache.phoenix.compile.FromCompiler$SingleTableColumnResolver.init(FromCompiler.java:237)
 at
 org.apache.phoenix.compile.FromCompiler$SingleTableColumnResolver.init(FromCompiler.java:231)
 at
 org.apache.phoenix.compile.FromCompiler.getResolverForMutation(FromCompiler.java:207)
 at
 org.apache.phoenix.compile.UpsertCompiler.compile(UpsertCompiler.java:248)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$ExecutableUpsertStatement.compilePlan(PhoenixStatement.java:487)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$ExecutableUpsertStatement.compilePlan(PhoenixStatement.java:478)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:279)
 at
 org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:272)
 at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
 at
 org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:270)
 at
 org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(PhoenixStatement.java:1052)
 at
 com.origins.platform.connectors.smartSink.SmartSink$$anonfun$loadData$1$$anonfun$apply$1.apply(SmartSink.scala:194)
 at
 com.origins.platform.connectors.smartSink.SmartSink$$anonfun$loadData$1$$anonfun$apply$1.apply(SmartSink.scala:175)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 com.origins.platform.connectors.smartSink.SmartSink$$anonfun$loadData$1.apply(SmartSink.scala:175)
 at
 com.origins.platform.connectors.smartSink.SmartSink$$anonfun$loadData$1.apply(SmartSink.scala:169)
 at
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
 at
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 

Flyway DB Migrations

2015-01-14 Thread Josh Mahonin
Hi all,

As an FYI, I've got a pull request into Flyway (http://flywaydb.org/) for
Phoenix support:
https://github.com/flyway/flyway/pull/930

I don't know what everyone else is using for schema management, if anything
at all, but the preliminary support works well enough for Flyway's various
commands (migrate, clean, baseline, repair, etc.).

A lot of the functionality is using direct queries against SYSTEM.CATALOG,
so it's entirely possible I'm doing something incorrectly, but it passes a
fairly extensive suite of tests. If anyone else wants to take a quick look
at the code to double check everything looks in order, that would be great.

I've only tested against Phoenix 4.2.2, though it should be
straight-forward enough to support other versions as well.

Josh


Re: Flyway DB Migrations

2015-01-14 Thread Josh Mahonin
Hi James,

I had actually hoped to use the DatabaseMetaData originally, but I was
getting some interesting behaviour when using the 'getTables()' query when
'schemaPattern' was null. I'm not at my dev. machine to check for sure, but
I think I was getting back a list of all tables, rather than just those for
which 'TABLE_SCHEM' was null.

I wasn't quite sure if that was expected behaviour or not, and I was
knee-deep in an unknown codebase with Flyway, so I just forged ahead and
used SYSTEM.CATALOG directly. Taking a second look at it, I think it would
have worked if I had set schemaPattern to  instead of passing in null,
although I'm not sure that's necessarily correct either. There's a bit of
an impedance mismatch between the notions of schemas in Flyway and Phoenix,
so I tried to find a reasonable balance with only a rote understanding of
each. A few more eyes on the problem would be very helpful!

I agree that long term using the DatabaseMetaData is the right way to go.
I'll continue to work on it when I can, but I'm hoping there might be some
interest on the mailing list to continue development on / fix any glaring
mistakes in there as well.

I'll take a look at that zookeeper exception, I'm certainly doing the most
naive thing possibly to get the in-memory cluster running so I quite likely
missed some important settings.

Thanks,

Josh

On Wed, Jan 14, 2015 at 8:35 PM, James Taylor jamestay...@apache.org
wrote:

 Wow, that's really awesome, Josh. Nice work. Can you let us know
 if/when it makes it in?

 One modification you may want to consider in a future revision to
 protect yourself in case the SYSTEM.CATALOG schema changes down the
 road: Use the DatabaseMetaData APIs[1] instead of querying the
 SYSTEM.CATALOG table directly. You can access this through the
 Connection (connection.getMetaData()).

 Just a wild guess, but for the zookeeper exception you mentioned that
 can be ignored, is it maybe related to not setting the
 hbase.master.info.port to -1?

 Thanks,
 James

 [1]
 http://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html

 On Wed, Jan 14, 2015 at 1:41 PM, Josh Mahonin jmaho...@interset.com
 wrote:
  Hi all,
 
  As an FYI, I've got a pull request into Flyway (http://flywaydb.org/)
 for
  Phoenix support:
  https://github.com/flyway/flyway/pull/930
 
  I don't know what everyone else is using for schema management, if
 anything
  at all, but the preliminary support works well enough for Flyway's
 various
  commands (migrate, clean, baseline, repair, etc.).
 
  A lot of the functionality is using direct queries against
 SYSTEM.CATALOG,
  so it's entirely possible I'm doing something incorrectly, but it passes
 a
  fairly extensive suite of tests. If anyone else wants to take a quick
 look
  at the code to double check everything looks in order, that would be
 great.
 
  I've only tested against Phoenix 4.2.2, though it should be
 straight-forward
  enough to support other versions as well.
 
  Josh



Re: Re: Fwd: Phoenix in production

2015-01-08 Thread Josh Mahonin
On Wed, Jan 7, 2015 at 1:43 PM, anil gupta anilgupt...@gmail.com wrote:

 Yup, I am aware of Spark HBase integration. Phoenix-Spark integration
 would be more sweet. :)


Hi Anil,

I'm using Spark and Phoenix in production fairly successfully. There's very
little required for integration, since Phoenix has Hadoop Input and Output
formats that Spark can use natively.

As well, there is anther project which aims to bring the full Spark SQL
integration to Phoenix:
https://github.com/simplymeasured/phoenix-spark

Josh


Problem with UPSERT SELECT with CHAR field

2014-08-18 Thread Josh Mahonin
Hi all,

I'm having problems creating a join table when one of the fields involved
is a CHAR. I have a reproducible test case below:

-- Create source table
CREATE TABLE IF NOT EXISTS SOURCE_TABLE(
  TID CHAR(3) NOT NULL,
  A UNSIGNED_INT NOT NULL,
  B UNSIGNED_INT NOT NULL
  CONSTRAINT pk PRIMARY KEY (TID, A, B));

-- Populate with sample data
UPSERT INTO SOURCE_TABLE(TID, A, B) VALUES ('1', 1, 1);
UPSERT INTO SOURCE_TABLE(TID, A, B) VALUES ('1', 1, 2);
UPSERT INTO SOURCE_TABLE(TID, A, B) VALUES ('1', 1, 3);
UPSERT INTO SOURCE_TABLE(TID, A, B) VALUES ('1', 2, 1);
UPSERT INTO SOURCE_TABLE(TID, A, B) VALUES ('1', 2, 2);

-- Create table for common counts
CREATE TABLE IF NOT EXISTS JOIN_TABLE(
  TID CHAR(3) NOT NULL,
  A UNSIGNED_INT NOT NULL,
  B UNSIGNED_INT NOT NULL,
  COUNT UNSIGNED_INT
  CONSTRAINT pk PRIMARY KEY (TID, A, B));

-- Populate with common occurrences
UPSERT INTO JOIN_TABLE(TID, A, B, COUNT)
SELECT t1.TID,
   t1.A,
   t2.A,
   COUNT(*)
FROM SOURCE_TABLE t1
INNER JOIN SOURCE_TABLE t2 ON t1.B = t2.B
WHERE t1.A != t2.A
  AND t1.TID = '1'
  AND t2.TID = '1'
GROUP BY t1.TID,
 t1.A,
 t2.A;


Unfortunately that last query fails with the following:
Error: ERROR 203 (22005): Type mismatch. expected: CHAR but was:
UNSIGNED_INT at column: TID
SQLState:  22005
ErrorCode: 203

This query works if I change the data type of TID into something integer
based, like a TINYINT, but the multi-tenancy guide suggests that the tenant
column must be a CHAR or VARCHAR. I'm using Phoenix 5.0.0-SNAPSHOT built on
the latest as of August 12.

Does anyone have any ideas?

Thanks,

Josh