Re: Spark 2.x OFF_HEAP persistence

2017-01-04 Thread Vin J
Thanks for the reply Gene. Looks like this means, with Spark 2.x, one has
to change from rdd.persist(StorageLevel.OFF_HEAP) to
rdd.saveAsTextFile(alluxioPath) / rdd.saveAsObjectFile (alluxioPath) for
guarantees like persisted rdd surviving a Spark JVM crash etc,  as also the
other benefits you mention.

Vin.

On Thu, Jan 5, 2017 at 2:50 AM, Gene Pang  wrote:

> Hi Vin,
>
> From Spark 2.x, OFF_HEAP was changed to no longer directly interface with
> an external block store. The previous tight dependency was restrictive and
> reduced flexibility. It looks like the new version uses the executor's off
> heap memory to allocate direct byte buffers, and does not interface with
> any external system for the data storage. I am not aware of a way to
> connect the new version of OFF_HEAP to Alluxio.
>
> You can experience similar benefits of the old OFF_HEAP <-> Tachyon mode
> as well as additional benefits like unified namespace
> 
>  or
> sharing in-memory data across applications, by using the Alluxio
> filesystem API
> .
>
> I hope this helps!
>
> Thanks,
> Gene
>
> On Wed, Jan 4, 2017 at 10:50 AM, Vin J  wrote:
>
>> Until Spark 1.6 I see there were specific properties to configure such as
>> the external block store master url (spark.externalBlockStore.url) etc to
>> use OFF_HEAP storage level which made it clear that an external Tachyon
>> type of block store as required/used for OFF_HEAP storage.
>>
>> Can someone clarify how this has been changed in Spark 2.x - because I do
>> not see config settings anymore that point Spark to an external block store
>> like Tachyon (now Alluxio) (or am i missing seeing it?)
>>
>> I understand there are ways to use Alluxio with Spark, but how about
>> OFF_HEAP storage - can Spark 2.x OFF_HEAP rdd persistence still exploit
>> alluxio/external block store? Any pointers to design decisions/Spark JIRAs
>> related to this will also help.
>>
>> Thanks,
>> Vin.
>>
>
>


Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Ankur Srivastava
This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at
org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at
org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava  wrote:

> Hi
>
> I am rerunning the pipeline to generate the exact trace, I have below part
> of trace from last run:
>
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
> s3n://, expected: file:///
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(
> RawLocalFileSystem.java:69)
> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(
> RawLocalFileSystem.java:516)
> at org.apache.hadoop.fs.ChecksumFileSystem.delete(
> ChecksumFileSystem.java:528)
>
> Also I think the error is happening in this part of the code
> "ConnectedComponents.scala:339" I am referring the code @
> https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/
> graphframes/lib/ConnectedComponents.scala
>
>   if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
> // TODO: remove this after DataFrame.checkpoint is implemented
> val out = s"${checkpointDir.get}/$iteration"
> ee.write.parquet(out)
> // may hit S3 eventually consistent issue
> ee = sqlContext.read.parquet(out)
>
> // remove previous checkpoint
> if (iteration > checkpointInterval) {
>   *FileSystem.get(sc.hadoopConfiguration)*
> *.delete(new Path(s"${checkpointDir.get}/${iteration -
> checkpointInterval}"), true)*
> }
>
> System.gc() // hint Spark to clean shuffle directories
>   }
>
>
> Thanks
> Ankur
>
> On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
> wrote:
>
>> Do you have more of the exception stack?
>>
>>
>> --
>> *From:* Ankur Srivastava 
>> *Sent:* Wednesday, January 4, 2017 4:40:02 PM
>> *To:* user@spark.apache.org
>> *Subject:* Spark GraphFrame ConnectedComponents
>>
>> Hi,
>>
>> I am trying to use the ConnectedComponent algorithm of GraphFrames but by
>> default it needs a checkpoint directory. As I am running my spark cluster
>> with S3 as the DFS and do not have access to HDFS file system I tried using
>> a s3 directory as checkpoint directory but I run into below exception:
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
>> s3n://, expected: file:///
>>
>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>>
>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>> ileSystem.java:69)
>>
>> If I set checkpoint interval to -1 to avoid checkpointing the driver just
>> hangs after 3 or 4 iterations.
>>
>> Is there some way I can set the default FileSystem to S3 for Spark or any
>> other option?
>>
>> Thanks
>> Ankur
>>
>>
>


Re: L1 regularized Logistic regression ?

2017-01-04 Thread Yang
ah, found it, it's https://www.google.com/search?q=OWLQN

thanks!

On Wed, Jan 4, 2017 at 7:34 PM, J G  wrote:

> I haven't run this, but there is an elasticnetparam for Logistic
> Regression here: https://spark.apache.org/docs/2.0.2/ml-
> classification-regression.html#logistic-regression
>
> You'd set elasticnetparam = 1 for Lasso
>
> On Wed, Jan 4, 2017 at 7:13 PM, Yang  wrote:
>
>> does mllib support this?
>>
>> I do see Lasso impl here https://github.com/apache
>> /spark/blob/master/mllib/src/main/scala/org/apache/spark/
>> mllib/regression/Lasso.scala
>>
>> if it supports LR , could you please show me a link? what algorithm does
>> it use?
>>
>> thanks
>>
>>
>


Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Ankur Srivastava
Hi

I am rerunning the pipeline to generate the exact trace, I have below part
of trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at
org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)

Also I think the error is happening in this part of the code
"ConnectedComponents.scala:339" I am referring the code @
https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala

  if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
// TODO: remove this after DataFrame.checkpoint is implemented
val out = s"${checkpointDir.get}/$iteration"
ee.write.parquet(out)
// may hit S3 eventually consistent issue
ee = sqlContext.read.parquet(out)

// remove previous checkpoint
if (iteration > checkpointInterval) {
  *FileSystem.get(sc.hadoopConfiguration)*
*.delete(new Path(s"${checkpointDir.get}/${iteration -
checkpointInterval}"), true)*
}

System.gc() // hint Spark to clean shuffle directories
  }


Thanks
Ankur

On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
wrote:

> Do you have more of the exception stack?
>
>
> --
> *From:* Ankur Srivastava 
> *Sent:* Wednesday, January 4, 2017 4:40:02 PM
> *To:* user@spark.apache.org
> *Subject:* Spark GraphFrame ConnectedComponents
>
> Hi,
>
> I am trying to use the ConnectedComponent algorithm of GraphFrames but by
> default it needs a checkpoint directory. As I am running my spark cluster
> with S3 as the DFS and do not have access to HDFS file system I tried using
> a s3 directory as checkpoint directory but I run into below exception:
>
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
> s3n://, expected: file:///
>
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>
> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(
> RawLocalFileSystem.java:69)
>
> If I set checkpoint interval to -1 to avoid checkpointing the driver just
> hangs after 3 or 4 iterations.
>
> Is there some way I can set the default FileSystem to S3 for Spark or any
> other option?
>
> Thanks
> Ankur
>
>


Re: L1 regularized Logistic regression ?

2017-01-04 Thread J G
I haven't run this, but there is an elasticnetparam for Logistic Regression
here:
https://spark.apache.org/docs/2.0.2/ml-classification-regression.html#logistic-regression


You'd set elasticnetparam = 1 for Lasso

On Wed, Jan 4, 2017 at 7:13 PM, Yang  wrote:

> does mllib support this?
>
> I do see Lasso impl here https://github.com/apache/spark/blob/master/
> mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala
>
> if it supports LR , could you please show me a link? what algorithm does
> it use?
>
> thanks
>
>


Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Felix Cheung
Do you have more of the exception stack?



From: Ankur Srivastava 
Sent: Wednesday, January 4, 2017 4:40:02 PM
To: user@spark.apache.org
Subject: Spark GraphFrame ConnectedComponents

Hi,

I am trying to use the ConnectedComponent algorithm of GraphFrames but by 
default it needs a checkpoint directory. As I am running my spark cluster with 
S3 as the DFS and do not have access to HDFS file system I tried using a s3 
directory as checkpoint directory but I run into below exception:


Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)

at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)

If I set checkpoint interval to -1 to avoid checkpointing the driver just hangs 
after 3 or 4 iterations.

Is there some way I can set the default FileSystem to S3 for Spark or any other 
option?

Thanks
Ankur



Spark GraphFrame ConnectedComponents

2017-01-04 Thread Ankur Srivastava
Hi,

I am trying to use the ConnectedComponent algorithm of GraphFrames but by
default it needs a checkpoint directory. As I am running my spark cluster
with S3 as the DFS and do not have access to HDFS file system I tried using
a s3 directory as checkpoint directory but I run into below exception:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
s3n://, expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)

at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(
RawLocalFileSystem.java:69)

If I set checkpoint interval to -1 to avoid checkpointing the driver just
hangs after 3 or 4 iterations.

Is there some way I can set the default FileSystem to S3 for Spark or any
other option?

Thanks
Ankur


L1 regularized Logistic regression ?

2017-01-04 Thread Yang
does mllib support this?

I do see Lasso impl here
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala

if it supports LR , could you please show me a link? what algorithm does it
use?

thanks


RE: Best way to process lookup ETL with Dataframes

2017-01-04 Thread Sesterhenn, Mike
Thanks a lot Nicholas.  RE: Upgrading, I was afraid someone would suggest that. 
 ☺  Yes we have an upgrade planned, but due to politics, we have to finish this 
first round of ETL before we can do the upgrade.  I can’t confirm for sure that 
this issue would be fixed in Spark >= 1.6 without doing the upgrade first, so I 
won’t be able to win the argument for upgrading yet…  You see the problem…   :(

Anyway, the good news is we just had a memory upgrade, so I should be able to 
do more persisting of the dataframes.  I am currently only persisting the join 
table (the table I am joining to, not the input data).  Although I do cache the 
input at some point before the join, it is not every time I do a split+merge.  
I’ll have to persist the input data better.

Thinking on it now, is it even necessary to cache the table I am joining to?  
Probably only if it is used more than once, right?

Thanks,
-Mike


From: Nicholas Hakobian [mailto:nicholas.hakob...@rallyhealth.com]
Sent: Friday, December 30, 2016 5:50 PM
To: Sesterhenn, Mike
Cc: ayan guha; user@spark.apache.org
Subject: Re: Best way to process lookup ETL with Dataframes

Yep, sequential joins is what I have done in the past with similar requirements.

Splitting and merging DataFrames is most likely killing performance if you do 
not cache the DataFrame pre-split. If you do, it will compute the lineage prior 
to the cache statement once (at first invocation), then use the cached result 
to perform the additional join, then union the results. Without the cache, you 
are most likely computing the full lineage twice, all the way back to the raw 
data import and having double the read I/O.

The most optimal path will most likely depend on the size of the tables you are 
joining to. If both are small (compared to the primary data source) and can be 
broadcasted, doing the sequential join will most likely be the easiest and most 
efficient approach. If one (or both) of the tables you are joining to are 
significantly large enough that they cannot be efficiently broadcasted, going 
through the join / cache / split / second join / union path is likely to be 
faster. It also depends on how much memory you can dedicate to caching...the 
possibilities are endless.

I tend to approach this type of problem by weighing the cost of extra 
development time for a more complex join vs the extra execution time vs 
frequency of execution. For something that will execute daily (or more 
frequently) the cost of more development to have faster execution time (even if 
its only 2x faster) might be worth it.

It might also be worth investigating if a newer version of Spark (1.6 at the 
least, or 2.0 if possible) is feasible to install. There are lots of 
performance improvements in those versions, if you have the option of upgrading.

-Nick

Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com


On Fri, Dec 30, 2016 at 3:35 PM, Sesterhenn, Mike 
> wrote:

Thanks Nicholas.  It looks like for some of my use cases, I might be able to 
use do sequential joins, and then use coalesce() (or in combination with 
withColumn(when()...)) to sort out the results.



Splitting and merging dataframes seems to really kills my app performance.  I'm 
not sure if it's a spark 1.5 thing or what, but I just refactored one column to 
do one less split/merge, and it saved me almost half the time on my job.  But 
for some use cases I don't seem to be able to avoid them.  It is important in 
some cases to NOT do a join under certain conditions for a row because bad data 
will result.



Any other thoughts?


From: Nicholas Hakobian 
>
Sent: Friday, December 30, 2016 2:12:40 PM
To: Sesterhenn, Mike
Cc: ayan guha; user@spark.apache.org

Subject: Re: Best way to process lookup ETL with Dataframes

It looks like Spark 1.5 has the coalesce function, which is like NVL, but a bit 
more flexible. From Ayan's example you should be able to use:
coalesce(b.col, c.col, 'some default')

If that doesn't have the flexibility you want, you can always use nested case 
or if statements, but its just harder to read.

Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com



On Fri, Dec 30, 2016 at 7:46 AM, Sesterhenn, Mike 
> wrote:

Thanks, but is nvl() in Spark 1.5?  I can't find it in spark.sql.functions 
(http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.functions$)



Reading about the Oracle nvl function, it seems it is similar to the na 
functions.  Not sure it will help though, because what I need is to join after 
the first join fails.



Re: Spark Aggregator for array of doubles

2017-01-04 Thread Anton Okolnychyi
Hi,

take a look at this pull request that is not merged yet:
https://github.com/apache/spark/pull/16329 . It contains examples in Java
and Scala that can be helpful.

Best regards,
Anton Okolnychyi

On Jan 4, 2017 23:23, "Anil Langote"  wrote:

> Hi All,
>
> I have been working on a use case where I have a DF which has 25 columns,
> 24 columns are of type string and last column is array of doubles. For a
> given set of columns I have to apply group by and add the array of doubles,
> I have implemented UDAF which works fine but it's expensive in order to
> tune the solution I came across Aggregators which can be implemented and
> used with agg function, my question is how can we implement a aggregator
> which takes array of doubles as input and returns the array of double.
>
> I learned that it's not possible to implement the aggregator in Java can
> be done in scala only how can define the aggregator which takes array of
> doubles as input, note that I have parquet file as my input.
>
> Any pointers are highly appreciated, I read that spark UDAF is slow and
> aggregators are the way to go.
>
> Best Regards,
>
> Anil Langote
>
> +1-425-633-9747
>


Spark Aggregator for array of doubles

2017-01-04 Thread Anil Langote
Hi All,

I have been working on a use case where I have a DF which has 25 columns, 24 
columns are of type string and last column is array of doubles. For a given set 
of columns I have to apply group by and add the array of doubles, I have 
implemented UDAF which works fine but it's expensive in order to tune the 
solution I came across Aggregators which can be implemented and used with agg 
function, my question is how can we implement a aggregator which takes array of 
doubles as input and returns the array of double.

I learned that it's not possible to implement the aggregator in Java can be 
done in scala only how can define the aggregator which takes array of doubles 
as input, note that I have parquet file as my input.

Any pointers are highly appreciated, I read that spark UDAF is slow and 
aggregators are the way to go.

Best Regards,
Anil Langote
+1-425-633-9747

IBM Fluid query versus Spark

2017-01-04 Thread Mich Talebzadeh
Hi,

Has anyone had any experience of using IBM Fluid query and comparing  it
with Spark with its MPP and in-memory capabilities?

Thanks,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


[Spark GraphX] Graph Aggregation

2017-01-04 Thread Will Swank
Hi All - I'm new to Spark and GraphX and I'm trying to perform a
simple sum operation for a graph.  I have posted this question to
StackOverflow and also on the gitter channel to no avail.  I'm
wondering if someone can help me out.  The StackOverflow link is here:
http://stackoverflow.com/questions/41451947/spark-graphx-aggregation-summation

The problem as posted is included below (formatting is better on SO):

I'm trying to compute the sum of node values in a spark graphx graph.
In short the graph is a tree and the top node (root) should sum all
children and their children. My graph is actually a tree that looks
like this and the expected summed value should be 1850:

 ++
 +--->|  VertexID 14
 |   ||  Value: 1000
 +---+--+++
+>  | VertexId 11
||  | Value: ++
|+--+ Sum of 14 & 24  |  VertexId 24
+---+++-->|  Value: 550
|| VertexId 20   ++
|| Value:
+++Sum of 11 & 911
  |
  |   +-+
  +---> | VertexId 911
  | | Value: 300
  +-+

The first stab at this looks like this:

val vertices: RDD[(VertexId, Int)] =
  sc.parallelize(Array((20L, 0)
, (11L, 0)
, (14L, 1000)
, (24L, 550)
, (911L, 300)
  ))

  //note that the last value in the edge is for factor (positive or negative)
val edges: RDD[Edge[Int]] =
  sc.parallelize(Array(
Edge(14L, 11L, 1),
Edge(24L, 11L, 1),
Edge(11L, 20L, 1),
Edge(911L, 20L, 1)
  ))

val dataItemGraph = Graph(vertices, edges)


val sum: VertexRDD[(Int, BigDecimal, Int)] =
dataItemGraph.aggregateMessages[(Int, BigDecimal, Int)](
  sendMsg = { triplet => triplet.sendToDst(1, triplet.srcAttr, 1) },
  mergeMsg = { (a, b) => (a._1, a._2 * a._3 + b._2 * b._3, 1) }
)

sum.collect.foreach(println)

This returns the following:

(20,(1,300,1))
(11,(1,1550,1))

It's doing the sum for vertex 11 but it's not rolling up to the root
node (vertex 20). What am I missing or is there a better way of doing
this? Of course the tree can be of arbitrary size and each vertex can
have an arbitrary number of children edges.

Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.x OFF_HEAP persistence

2017-01-04 Thread Gene Pang
Hi Vin,

>From Spark 2.x, OFF_HEAP was changed to no longer directly interface with
an external block store. The previous tight dependency was restrictive and
reduced flexibility. It looks like the new version uses the executor's off
heap memory to allocate direct byte buffers, and does not interface with
any external system for the data storage. I am not aware of a way to
connect the new version of OFF_HEAP to Alluxio.

You can experience similar benefits of the old OFF_HEAP <-> Tachyon mode as
well as additional benefits like unified namespace

or
sharing in-memory data across applications, by using the Alluxio filesystem
API .

I hope this helps!

Thanks,
Gene

On Wed, Jan 4, 2017 at 10:50 AM, Vin J  wrote:

> Until Spark 1.6 I see there were specific properties to configure such as
> the external block store master url (spark.externalBlockStore.url) etc to
> use OFF_HEAP storage level which made it clear that an external Tachyon
> type of block store as required/used for OFF_HEAP storage.
>
> Can someone clarify how this has been changed in Spark 2.x - because I do
> not see config settings anymore that point Spark to an external block store
> like Tachyon (now Alluxio) (or am i missing seeing it?)
>
> I understand there are ways to use Alluxio with Spark, but how about
> OFF_HEAP storage - can Spark 2.x OFF_HEAP rdd persistence still exploit
> alluxio/external block store? Any pointers to design decisions/Spark JIRAs
> related to this will also help.
>
> Thanks,
> Vin.
>


Re: Approach: Incremental data load from HBASE

2017-01-04 Thread ayan guha
Hi Chetan

What do you mean by incremental load from HBase? There is a timestamp
marker for each cell, but not at Row level.

On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri 
wrote:

> Ted Yu,
>
> You understood wrong, i said Incremental load from HBase to Hive,
> individually you can say Incremental Import from HBase.
>
> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:
>
>> Incremental load traditionally means generating hfiles and
>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>> the data into hbase.
>>
>> For your use case, the producer needs to find rows where the flag is 0 or
>> 1.
>> After such rows are obtained, it is up to you how the result of
>> processing is delivered to hbase.
>>
>> Cheers
>>
>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Ok, Sure will ask.
>>>
>>> But what would be generic best practice solution for Incremental load
>>> from HBASE.
>>>
>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>>>
 I haven't used Gobblin.
 You can consider asking Gobblin mailing list of the first option.

 The second option would work.


 On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
 chetan.opensou...@gmail.com> wrote:

> Hello Guys,
>
> I would like to understand different approach for Distributed
> Incremental load from HBase, Is there any *tool / incubactor tool* which
> satisfy requirement ?
>
> *Approach 1:*
>
> Write Kafka Producer and maintain manually column flag for events and
> ingest it with Linkedin Gobblin to HDFS / S3.
>
> *Approach 2:*
>
> Run Scheduled Spark Job - Read from HBase and do transformations and
> maintain flag column at HBase Level.
>
> In above both approach, I need to maintain column level flags. such as
> 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
> take another 1000 rows of batch where flag is 0 or 1.
>
> I am looking for best practice approach with any distributed tool.
>
> Thanks.
>
> - Chetan Khatri
>


>>>
>>
>


-- 
Best Regards,
Ayan Guha


Re: Difference in R and Spark Output

2017-01-04 Thread Satya Varaprasad Allumallu
Looks like default algorithm used by R in kmeans function is Hartigan-Wong
whereas Spark seems to be using Lloyd's algorithm.
Can you rerun your kmeans R code using algorithm = "Lloyd" and see if the
results match?


On Tue, Jan 3, 2017 at 12:18 AM, Saroj C  wrote:

> Thanks Satya.
>
>  I tried setting the initSteps as 25 and the maxIteration as 500, both in
> R and Spark. The results provided below were from that settings.
>
> Also, in Spark and R the center remains almost the same, but they are
> different from each other.
>
>
> Thanks & Regards
> Saroj
>
>
>
>
> From:Satya Varaprasad Allumallu 
> To:Saroj C 
> Cc:User 
> Date:01/02/2017 08:53 PM
> Subject:Re: Difference in R and Spark Output
> --
>
>
>
> Can you run Spark Kmeans algorithm multiple times and see if the centers
> remain stable? I am
> guessing it is related to random initialization of centers.
>
> On Mon, Jan 2, 2017 at 1:34 AM, Saroj C <*saro...@tcs.com*
> > wrote:
> Dear Felix,
>  Thanks. Please find the differences
> Cluster Spark - Size R- Size
>
> 0
> 69
> 114
> 1
> 79
> 141
> 2
> 77
> 93
> 3
> 90
> 44
> 4
> 130
> 53
>
>
>
> Spark - Centers
>
> 0.807554406
> 0.123759
> -0.58642
> -0.17803
> 0.624278
> -0.06752
> 0.033517
> -0.01504
> -0.02794
> 0.016699
> 0.20841
> -0.00149
> -0.05598
> 0.039746
> 0.030756
> -0.19788
> -0.07906
> -0.14881
> 0.0056
> 0.01479
> 0.066883
> 0.002491
> -0.428583581
> -0.81975
> 0.347356
> -0.18664
> 0.047582
> 0.058692
> -0.0721
> -0.13873
> -0.08666
> 0.085334
> 0.054398
> -0.0228
> 0.008369
> 0.073103
> 0.022246
> -0.15439
> -0.06016
> -0.15073
> -0.03734
> 0.004299
> 0.089258
> -0.00694
> 0.692744675
> 0.148123
> 0.087253
> 0.851781
> -0.2179
> 0.003407
> -0.12357
> -0.01795
> 0.016427
> 0.088004
> 0.021502
> -0.04616
> -0.00847
> 0.023397
> 0.057656
> -0.12036
> -0.03947
> -0.13338
> -0.02975
> 0.012217
> 0.090547
> -0.00232
> -0.677692276
> 0.581091
> 0.446125
> -0.13087
> 0.037225
> 0.018936
> 0.055286
> 0.01146
> -0.08648
> 0.053719
> 0.072753
> -0.00873
> -0.04448
> 0.042067
> 0.089221
> -0.1977
> -0.07368
> -0.14674
> -0.00641
> 0.020815
> 0.058425
> 0.016745
> 1.03518389
> 0.228964
> 0.539982
> -0.3581
> -0.13488
> -0.00525
> -0.1267
> -0.04439
> -0.01923
> 0.111272
> -0.05181
> -0.05508
> -0.04143
> 0.046479
> 0.059224
> -0.16148
> -0.07541
> -0.12046
> -0.03535
> 0.003049
> 0.070862
> 0.010083
> R - Centers
>
> 0.7710882
> 0.86271
> 0.249609
> 0.074961
> 0.251188
> -0.05293
> -0.11106
> -0.08063
> 0.01516
> 0.054043
> 0.056937
> -0.0287
> -0.03291
> 0.056607
> 0.045214
> -0.15237
> -0.05442
> -0.14038
> -0.02326
> 0.013882
> 0.078523
> -0.0087
> -0.644077
> 0.022256
> 0.368266
> -0.06912
> 0.123979
> 0.009181
> -0.04506
> -0.04179
> -0.0255
> 0.041568
> 0.04081
> -0.02936
> -0.04849
> 0.049712
> 0.062894
> -0.16736
> -0.06679
> -0.12705
> -0.007
> 0.018079
> 0.062337
> 0.00349
> 0.9772678
> -0.57499
> 0.523792
> -0.27319
> 0.163677
> 0.053579
> -0.07616
> 0.074556
> 0.00662
> 0.087303
> 0.088835
> -0.01923
> -0.04938
> 0.07299
> 0.059872
> -0.19137
> -0.04737
> -0.1536
> 0.002926
> 0.049441
> 0.079147
> 0.02771
> 0.5172924
> 0.167666
> -0.16523
> -0.82951
> -0.77577
> -0.00981
> 0.018531
> -0.09629
> -0.1654
> 0.273644
> -0.05433
> -0.03593
> 0.115834
> 0.021465
> -0.00981
> -0.15112
> -0.16178
> -0.04783
> -0.19962
> -0.12418
> 0.07286
> 0.03266
> 0.717927
> -0.34229
> -0.33544
> 0.817617
> -0.21383
> 0.02735
> 0.01675
> -0.10814
> -0.1747
> 0.033743
> 0.038308
> -0.0495
> -0.05961
> -0.01977
> 0.092247
> -0.16017
> -0.04787
> -0.20766
> 0.040038
> 0.024614
> 0.090587
> -0.0236
>
>
>
>
> Please let me know, if any additional info will help to find these
> anomalies.
>
> Thanks & Regards
> Saroj
>
>
>
>
> From:Felix Cheung <*felixcheun...@hotmail.com*
> >
> To:User <*user@spark.apache.org* >, Saroj
> C <*saro...@tcs.com* >
> Date:12/31/2016 10:36 AM
> Subject:Re: Difference in R and Spark Output
> --
>
>
>
>
> Could you elaborate more on the huge difference you are seeing?
>
>
> --
>
> * From:* Saroj C <*saro...@tcs.com* >
> * Sent:* Friday, December 30, 2016 5:12:04 AM
> * To:* User
> * Subject:* Difference in R and Spark Output
>
> Dear All,
> For the attached input file, there is a huge difference between the
> Clusters in R and Spark(ML). Any idea, what could be the difference ?
>
> Note we wanted to create Five(5) clusters.
>
> Please find the snippets in Spark and R
>
> Spark
>
> //Load the Data file
>
> // Create K means Cluster
>KMeans kmeans = *new* KMeans().setK(5).setMaxIter(500)
>.setFeaturesCol("features").
> setPredictionCol("prediction");
>
>
> In R
>
> //Load the Data File into df
>
> //Create the K Means 

Re: Dynamic Allocation not respecting spark.executor.cores

2017-01-04 Thread Nirav Patel
If this is not an expected behavior then its should be logged as an issue.

On Tue, Jan 3, 2017 at 2:51 PM, Nirav Patel  wrote:

> When enabling dynamic scheduling I see that all executors are using only 1
> core even if I specify "spark.executor.cores" to 6. If dynamic scheduling
> is disable then each executors will have 6 cores. I have tested this
> against spark 1.5 . I wonder if this is the same behavior with 2.x as well.
>
> Thanks
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Converting an InternalRow to a Row

2017-01-04 Thread Andy Dang
Hi all,
(cc-ing dev since I've hit some developer API corner)

What's the best way to convert an InternalRow to a Row if I've got an
InternalRow and the corresponding Schema.

Code snippet:
@Test
public void foo() throws Exception {
Row row = RowFactory.create(1);
StructType struct = new StructType().add("id",
DataTypes.IntegerType);
ExpressionEncoder enconder = RowEncoder.apply(struct);
InternalRow internalRow = enconder.toRow(row);
System.out.println("Internal row size: " + internalRow.numFields());
Row roundTrip = enconder.fromRow(internalRow);
System.out.println("Round trip: " + roundTrip.size());
}

The code fails at the line encoder.fromRow() with the exception:
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate
expression: getcolumnbyordinal(0, IntegerType)

---
Regards,
Andy


Spark 2.x OFF_HEAP persistence

2017-01-04 Thread Vin J
Until Spark 1.6 I see there were specific properties to configure such as
the external block store master url (spark.externalBlockStore.url) etc to
use OFF_HEAP storage level which made it clear that an external Tachyon
type of block store as required/used for OFF_HEAP storage.

Can someone clarify how this has been changed in Spark 2.x - because I do
not see config settings anymore that point Spark to an external block store
like Tachyon (now Alluxio) (or am i missing seeing it?)

I understand there are ways to use Alluxio with Spark, but how about
OFF_HEAP storage - can Spark 2.x OFF_HEAP rdd persistence still exploit
alluxio/external block store? Any pointers to design decisions/Spark JIRAs
related to this will also help.

Thanks,
Vin.


Re: (send this email to subscribe)

2017-01-04 Thread Dinko Srkoč
You can run Spark app on Dataproc, which is Google's managed Spark and
Hadoop service:

https://cloud.google.com/dataproc/docs/

basically, you:

* assemble a jar
* create a cluster
* submit a job to that cluster (with the jar)
* delete a cluster when the job is done

Before all that, one has to create a Cloud Platform project, enable
billing and Dataproc API - but all this is explained in the docs.

Cheers,
Dinko


On 4 January 2017 at 17:34, Anahita Talebi  wrote:
>
> To whom it might concern,
>
> I have a question about running a spark code on Google cloud.
>
> Actually, I have a spark code and would like to run it using multiple
> machines on Google cloud. Unfortunately, I couldn't find a good
> documentation about how to do it.
>
> Do you have any hints which could help me to solve my problem?
>
> Have a nice day,
>
> Anahita
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: top-k function for Window

2017-01-04 Thread Georg Heiler
What about
https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF
Koert Kuipers  schrieb am Mi. 4. Jan. 2017 um 16:11:

> i assumed topk of frequencies in one pass. if its topk by known
> sorting/ordering then use priority queue aggregator instead of spacesaver.
>
> On Tue, Jan 3, 2017 at 3:11 PM, Koert Kuipers  wrote:
>
> i dont know anything about windowing or about not using developer apis...
>
> but
>
> but a trivial implementation of top-k requires a total sort per group.
> this can be done with dataset. we do this using spark-sorted (
> https://github.com/tresata/spark-sorted) but its not hard to do it
> yourself for datasets either. for rdds its actually a little harder i think
> (if you want to avoid in memory assumption, which i assume you do)..
>
> a perhaps more efficient implementation uses an aggregator. it is not hard
> to adapt algebirds topk aggregator (spacesaver) to use as a spark
> aggregator. this requires a simple adapter class. we do this in-house as
> well. although i have to say i would recommend spark 2.1.0 for this. spark
> 2.0.x aggregator codegen is too buggy in my experience.
>
> On Tue, Jan 3, 2017 at 2:09 PM, Andy Dang  wrote:
>
> Hi Austin,
>
> It's trivial to implement top-k in the RDD world - however I would like to
> stay in the Dataset API world instead of flip-flopping between the two APIs
> (consistency, wholestage codegen etc).
>
> The twitter library appears to support only RDD, and the solution you gave
> me is very similar to what I did - it doesn't work very well with skewed
> dataset :) (it has to perform the sort to work out the row number).
>
> I've been toying with the UDAF idea, but the more I write the code the
> more I see myself digging deeper into the developer API land  - not very
> ideal to be honest. Also, UDAF doesn't have any concept of sorting, so it
> gets messy really fast.
>
> ---
> Regards,
> Andy
>
> On Tue, Jan 3, 2017 at 6:59 PM, HENSLEE, AUSTIN L  wrote:
>
> Andy,
>
>
>
> You might want to also checkout the Algebird libraries from Twitter. They
> have topK and a lot of other helpful functions. I’ve used the Algebird topk
> successfully on very large data sets.
>
>
>
> You can also use Spark SQL to do a “poor man’s” topK. This depends on how
> scrupulous you are about your TopKs (I can expound on this, if needed).
>
>
>
> I obfuscated the field names, before pasting this into email – I think I
> got them all consistently.
>
>
>
> Here’s the meat of the TopK part (found on SO, but I don’t have a
> reference) – this one takes the top 4, hence “rowNum <= 4”:
>
>
>
> SELECT time_bucket,
>
>identifier1,
>
>identifier2,
>
>incomingCount
>
>   FROM (select time_bucket,
>
> identifier1,
>
> identifier2,
>
> incomingCount,
>
>ROW_NUMBER() OVER (PARTITION BY time_bucket,
>
>identifier1
>
>   ORDER BY count DESC) as rowNum
>
>   FROM tablename) tmp
>
>   WHERE rowNum <=4
>
>   ORDER BY time_bucket, identifier1, rowNum
>
>
>
> The count and order by:
>
>
>
>
>
> SELECT time_bucket,
>
>identifier1,
>
>identifier2,
>
>count(identifier2) as myCount
>
>   FROM table
>
>   GROUP BY time_bucket,
>
>identifier1,
>
>identifier2
>
>   ORDER BY time_bucket,
>
>identifier1,
>
>count(identifier2) DESC
>
>
>
>
>
> *From: *Andy Dang 
> *Date: *Tuesday, January 3, 2017 at 7:06 AM
> *To: *user 
> *Subject: *top-k function for Window
>
>
>
> Hi all,
>
>
>
> What's the best way to do top-k with Windowing in Dataset world?
>
>
>
> I have a snippet of code that filters the data to the top-k, but with
> skewed keys:
>
>
>
> val windowSpec = Window.parititionBy(skewedKeys).orderBy(dateTime)
>
> val rank = row_number().over(windowSpec)
>
>
>
> input.withColumn("rank", rank).filter("rank <= 10").drop("rank")
>
>
>
> The problem with this code is that Spark doesn't know that it can sort the
> data locally, get the local rank first. What it ends up doing is performing
> a sort by key using the skewed keys, and this blew up the cluster since the
> keys are heavily skewed.
>
>
>
> In the RDD world we can do something like:
>
> rdd.mapPartitioins(iterator -> topK(iterator))
>
> but I can't really think of an obvious to do this in the Dataset API,
> especially with Window function. I guess some UserAggregateFunction would
> do, but I wonder if there's obvious way that I missed.
>
>
>
> ---
> Regards,
> Andy
>
>
>
>
>


Re: spark sql in Cloudera package

2017-01-04 Thread Sean Owen
(You can post this on the CDH lists BTW as it's more about that
distribution.) The whole thrift server isn't supported / enabled in CDH, so
I think that's why the script isn't turned on either. I don't think it's as
much about using Impala as not wanting to do all the grunt work to make it
compatible with the (slightly different) Hive version in CDH, just to
provide roughly the same functionality, yeah.

On Wed, Jan 4, 2017 at 3:17 PM Mich Talebzadeh 
wrote:

> Sounds like Cloudera do not supply the shell for spark-sql but only
> spark-shell
>
> is that correct?
>
> I appreciate that one can use spark-shell. however, sounds like spark-sql
> is excluded in favour of Impala?
>
> cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


spark sql in Cloudera package

2017-01-04 Thread Mich Talebzadeh
Sounds like Cloudera do not supply the shell for spark-sql but only
spark-shell

is that correct?

I appreciate that one can use spark-shell. however, sounds like spark-sql
is excluded in favour of Impala?

cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: top-k function for Window

2017-01-04 Thread Koert Kuipers
i assumed topk of frequencies in one pass. if its topk by known
sorting/ordering then use priority queue aggregator instead of spacesaver.

On Tue, Jan 3, 2017 at 3:11 PM, Koert Kuipers  wrote:

> i dont know anything about windowing or about not using developer apis...
>
> but
>
> but a trivial implementation of top-k requires a total sort per group.
> this can be done with dataset. we do this using spark-sorted (
> https://github.com/tresata/spark-sorted) but its not hard to do it
> yourself for datasets either. for rdds its actually a little harder i think
> (if you want to avoid in memory assumption, which i assume you do)..
>
> a perhaps more efficient implementation uses an aggregator. it is not hard
> to adapt algebirds topk aggregator (spacesaver) to use as a spark
> aggregator. this requires a simple adapter class. we do this in-house as
> well. although i have to say i would recommend spark 2.1.0 for this. spark
> 2.0.x aggregator codegen is too buggy in my experience.
>
> On Tue, Jan 3, 2017 at 2:09 PM, Andy Dang  wrote:
>
>> Hi Austin,
>>
>> It's trivial to implement top-k in the RDD world - however I would like
>> to stay in the Dataset API world instead of flip-flopping between the two
>> APIs (consistency, wholestage codegen etc).
>>
>> The twitter library appears to support only RDD, and the solution you
>> gave me is very similar to what I did - it doesn't work very well with
>> skewed dataset :) (it has to perform the sort to work out the row number).
>>
>> I've been toying with the UDAF idea, but the more I write the code the
>> more I see myself digging deeper into the developer API land  - not very
>> ideal to be honest. Also, UDAF doesn't have any concept of sorting, so it
>> gets messy really fast.
>>
>> ---
>> Regards,
>> Andy
>>
>> On Tue, Jan 3, 2017 at 6:59 PM, HENSLEE, AUSTIN L  wrote:
>>
>>> Andy,
>>>
>>>
>>>
>>> You might want to also checkout the Algebird libraries from Twitter.
>>> They have topK and a lot of other helpful functions. I’ve used the Algebird
>>> topk successfully on very large data sets.
>>>
>>>
>>>
>>> You can also use Spark SQL to do a “poor man’s” topK. This depends on
>>> how scrupulous you are about your TopKs (I can expound on this, if needed).
>>>
>>>
>>>
>>> I obfuscated the field names, before pasting this into email – I think I
>>> got them all consistently.
>>>
>>>
>>>
>>> Here’s the meat of the TopK part (found on SO, but I don’t have a
>>> reference) – this one takes the top 4, hence “rowNum <= 4”:
>>>
>>>
>>>
>>> SELECT time_bucket,
>>>
>>>identifier1,
>>>
>>>identifier2,
>>>
>>>incomingCount
>>>
>>>   FROM (select time_bucket,
>>>
>>> identifier1,
>>>
>>> identifier2,
>>>
>>> incomingCount,
>>>
>>>ROW_NUMBER() OVER (PARTITION BY time_bucket,
>>>
>>>identifier1
>>>
>>>   ORDER BY count DESC) as rowNum
>>>
>>>   FROM tablename) tmp
>>>
>>>   WHERE rowNum <=4
>>>
>>>   ORDER BY time_bucket, identifier1, rowNum
>>>
>>>
>>>
>>> The count and order by:
>>>
>>>
>>>
>>>
>>>
>>> SELECT time_bucket,
>>>
>>>identifier1,
>>>
>>>identifier2,
>>>
>>>count(identifier2) as myCount
>>>
>>>   FROM table
>>>
>>>   GROUP BY time_bucket,
>>>
>>>identifier1,
>>>
>>>identifier2
>>>
>>>   ORDER BY time_bucket,
>>>
>>>identifier1,
>>>
>>>count(identifier2) DESC
>>>
>>>
>>>
>>>
>>>
>>> *From: *Andy Dang 
>>> *Date: *Tuesday, January 3, 2017 at 7:06 AM
>>> *To: *user 
>>> *Subject: *top-k function for Window
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> What's the best way to do top-k with Windowing in Dataset world?
>>>
>>>
>>>
>>> I have a snippet of code that filters the data to the top-k, but with
>>> skewed keys:
>>>
>>>
>>>
>>> val windowSpec = Window.parititionBy(skewedKeys).orderBy(dateTime)
>>>
>>> val rank = row_number().over(windowSpec)
>>>
>>>
>>>
>>> input.withColumn("rank", rank).filter("rank <= 10").drop("rank")
>>>
>>>
>>>
>>> The problem with this code is that Spark doesn't know that it can sort
>>> the data locally, get the local rank first. What it ends up doing is
>>> performing a sort by key using the skewed keys, and this blew up the
>>> cluster since the keys are heavily skewed.
>>>
>>>
>>>
>>> In the RDD world we can do something like:
>>>
>>> rdd.mapPartitioins(iterator -> topK(iterator))
>>>
>>> but I can't really think of an obvious to do this in the Dataset API,
>>> especially with Window function. I guess some UserAggregateFunction would
>>> do, but I wonder if there's obvious way that I missed.
>>>
>>>
>>>
>>> ---
>>> Regards,
>>> Andy
>>>
>>
>>
>


Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread darren
We've been able to use ipopo dependency injection framework in our pyspark 
system and deploy .egg pyspark apps that resolve and wire up all the components 
(like a kernel architecture. Also similar to spring) during an initial 
bootstrap sequence; then invoke those components across spark.
Just replying for info since it's not identical to your request but in the same 
spirit.
Darren


Sent from my Verizon, Samsung Galaxy smartphone
 Original message From: Chetan Khatri 
 Date: 1/4/17  6:34 AM  (GMT-05:00) To: Lars 
Albertsson  Cc: user , Spark Dev List 
 Subject: Re: Dependency Injection and Microservice 
development with Spark 
Lars,
Thank you, I want to use DI for configuring all the properties (wiring) for 
below architectural approach.
Oracle -> Kafka Batch (Event Queuing) -> Spark Jobs( Incremental load from 
HBase -> Hive with Transformation) -> Spark Transformation -> PostgreSQL
Thanks.
On Thu, Dec 29, 2016 at 3:25 AM, Lars Albertsson  wrote:
Do you really need dependency injection?



DI is often used for testing purposes. Data processing jobs are easy

to test without DI, however, due to their functional and synchronous

nature. Hence, DI is often unnecessary for testing data processing

jobs, whether they are batch or streaming jobs.



Or do you want to use DI for other reasons?





Lars Albertsson

Data engineering consultant

www.mapflat.com

https://twitter.com/lalleal

+46 70 7687109

Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com





On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri

 wrote:

> Hello Community,

>

> Current approach I am using for Spark Job Development with Scala + SBT and

> Uber Jar with yml properties file to pass configuration parameters. But If i

> would like to use Dependency Injection and MicroService Development like

> Spring Boot feature in Scala then what would be the standard approach.

>

> Thanks

>

> Chetan





Re: Issue with SparkR setup on RStudio

2017-01-04 Thread Md. Rezaul Karim
Cheung,

The problem has been solved after switching from Windows to Linux
environment.

Thanks.



Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


On 2 January 2017 at 18:59, Felix Cheung  wrote:

> Perhaps it is with
>
> spark.sql.warehouse.dir="E:/Exp/"
>
> That you have in the sparkConfig parameter.
>
> Unfortunately the exception stack is fairly far away from the actual
> error, but from the top of my head spark.sql.warehouse.dir and HADOOP_HOME
> are the two different pieces that is not set in the Windows tests.
>
>
> _
> From: Md. Rezaul Karim 
> Sent: Monday, January 2, 2017 7:58 AM
> Subject: Re: Issue with SparkR setup on RStudio
> To: Felix Cheung 
> Cc: spark users 
>
>
> Hello Cheung,
>
> Happy New Year!
>
> No, I did not configure Hive on my machine. Even I have tried not setting
> the HADOOP_HOME but getting the same error.
>
>
>
> Regards,
> _
> *Md. Rezaul Karim* BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web:http://www.reza-analytics.eu/index.html
> 
>
> On 29 December 2016 at 19:16, Felix Cheung 
> wrote:
>
>> Any reason you are setting HADOOP_HOME?
>>
>> From the error it seems you are running into issue with Hive config
>> likely with trying to load hive-site.xml. Could you try not setting
>> HADOOP_HOME
>>
>>
>> --
>> *From:* Md. Rezaul Karim 
>> *Sent:* Thursday, December 29, 2016 10:24:57 AM
>> *To:* spark users
>> *Subject:* Issue with SparkR setup on RStudio
>>
>>
>> Dear Spark users,
>>
>> I am trying to setup SparkR on RStudio to perform some basic data
>> manipulations and MLmodeling.  However, I am a strange error while
>> creating SparkR session or DataFrame that 
>> says:java.lang.IllegalArgumentException
>> Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState.
>>
>> According to Spark documentation athttp://spark.apache.org/
>> docs/latest/sparkr.html#starting-up-sparksession, I don’t need to
>> configure Hive path or related variables.
>>
>> I have the following source code:
>>
>> SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
>> HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"
>>
>> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R",
>> "lib")))
>> sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]",
>> sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/",
>> spark.driver.memory = "8g"), enableHiveSupport = TRUE)
>>
>> # Create a simple local data.frame
>> localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
>> # Convert local data frame to a SparkDataFrame
>> df <- createDataFrame(localDF)
>> print(df)
>> head(df)
>> sparkR.session.stop()
>>
>> Please note that the HADOOP_HOME contains the ‘*winutils.exe’* file. The
>> details of the eror is as follows:
>>
>> Error in handleErrors(returnStatus, conn) :  
>> java.lang.IllegalArgumentException: Error while instantiating 
>> 'org.apache.spark.sql.hive.HiveSessionState':
>>
>>at 
>> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>>
>>at 
>> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>>
>>at 
>> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>>
>>at 
>> org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)
>>
>>at 
>> org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)
>>
>>at 
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>>
>>at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>>
>>at 
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>>
>>at 
>> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>
>>at 
>> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>
>>at scala.collection.Traversabl
>>
>>
>>  Any kind of help would be appreciated.
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim* BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web:http://www.reza-analytics.eu/index.html
>> 

Initial job has not accepted any resources

2017-01-04 Thread Igor Berman
Hi All,
need your advice:
we see in some very rare cases following error in log
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources

and in spark UI there are idle workers and application in WAITING state

in json endpoint I see

"cores" : 280,
  "coresused" : 0,
  "memory" : 2006561,
  "memoryused" : 0,
  "activeapps" : [ {
"starttime" : 1483534808858,
"id" : "app-20170104130008-0181",
"name" : "our name",
"cores" : -1,
"user" : "spark",
"memoryperslave" : 31744,
"submitdate" : "Wed Jan 04 13:00:08 UTC 2017",
"state" : "WAITING",
"duration" : 6568575
  } ],


when I kill the application and restart it - everything works fine,
ie. it's not an issue that some workers are not properly connected,
workers are there, and usually work fine

Is there some way to handle this? Maybe some timeout on this WAITING
state, so that it will exit automatically, because currently it might
be "WAITING" indefinitely...

I've thought of implementing periodic check(by calling rest api /json)
that will kill application when waiting time > 10-15 mins for some
activeapp

any advice will be appreciated,

thanks in advance

Igor


Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Jiří Syrový
Hi,

another nice approach is to use instead of it Reader monad and some
framework to support this approach (e.g. Grafter -
https://github.com/zalando/grafter). It's lightweight and helps a bit with
dependencies issues.

2016-12-28 22:55 GMT+01:00 Lars Albertsson :

> Do you really need dependency injection?
>
> DI is often used for testing purposes. Data processing jobs are easy
> to test without DI, however, due to their functional and synchronous
> nature. Hence, DI is often unnecessary for testing data processing
> jobs, whether they are batch or streaming jobs.
>
> Or do you want to use DI for other reasons?
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> https://twitter.com/lalleal
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>
>
> On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri
>  wrote:
> > Hello Community,
> >
> > Current approach I am using for Spark Job Development with Scala + SBT
> and
> > Uber Jar with yml properties file to pass configuration parameters. But
> If i
> > would like to use Dependency Injection and MicroService Development like
> > Spring Boot feature in Scala then what would be the standard approach.
> >
> > Thanks
> >
> > Chetan
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Approach: Incremental data load from HBASE

2017-01-04 Thread Chetan Khatri
Ted Yu,

You understood wrong, i said Incremental load from HBase to Hive,
individually you can say Incremental Import from HBase.

On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:

> Incremental load traditionally means generating hfiles and
> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
> data into hbase.
>
> For your use case, the producer needs to find rows where the flag is 0 or
> 1.
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ok, Sure will ask.
>>
>> But what would be generic best practice solution for Incremental load
>> from HBASE.
>>
>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>>
>>> I haven't used Gobblin.
>>> You can consider asking Gobblin mailing list of the first option.
>>>
>>> The second option would work.
>>>
>>>
>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hello Guys,

 I would like to understand different approach for Distributed
 Incremental load from HBase, Is there any *tool / incubactor tool* which
 satisfy requirement ?

 *Approach 1:*

 Write Kafka Producer and maintain manually column flag for events and
 ingest it with Linkedin Gobblin to HDFS / S3.

 *Approach 2:*

 Run Scheduled Spark Job - Read from HBase and do transformations and
 maintain flag column at HBase Level.

 In above both approach, I need to maintain column level flags. such as
 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
 take another 1000 rows of batch where flag is 0 or 1.

 I am looking for best practice approach with any distributed tool.

 Thanks.

 - Chetan Khatri

>>>
>>>
>>
>


Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Chetan Khatri
Lars,

Thank you, I want to use DI for configuring all the properties (wiring) for
below architectural approach.

Oracle -> Kafka Batch (Event Queuing) -> Spark Jobs( Incremental load from
HBase -> Hive with Transformation) -> Spark Transformation -> PostgreSQL

Thanks.

On Thu, Dec 29, 2016 at 3:25 AM, Lars Albertsson  wrote:

> Do you really need dependency injection?
>
> DI is often used for testing purposes. Data processing jobs are easy
> to test without DI, however, due to their functional and synchronous
> nature. Hence, DI is often unnecessary for testing data processing
> jobs, whether they are batch or streaming jobs.
>
> Or do you want to use DI for other reasons?
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> https://twitter.com/lalleal
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>
>
> On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri
>  wrote:
> > Hello Community,
> >
> > Current approach I am using for Spark Job Development with Scala + SBT
> and
> > Uber Jar with yml properties file to pass configuration parameters. But
> If i
> > would like to use Dependency Injection and MicroService Development like
> > Spring Boot feature in Scala then what would be the standard approach.
> >
> > Thanks
> >
> > Chetan
>


Re: Error: PartitioningCollection requires all of its partitionings have the same numPartitions.

2017-01-04 Thread mhornbech
I am also experiencing this. Do you have a JIRA on it?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-PartitioningCollection-requires-all-of-its-partitionings-have-the-same-numPartitions-tp27875p28272.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Hive with Spark Configuration

2017-01-04 Thread Chetan Khatri
Ryan,

I agree that Hive 1.2.1 work reliably with Spark 2.x , but i went through
with current stable version of Hive which is 2.0.1 and I am working with
that. seems good but i want to make sure the which version of Hive is more
reliable with Spark 2.x and i think @Ryan you replied the same which is
hive 1.2.1 .

Thanks.



On Wed, Jan 4, 2017 at 2:02 AM, Ryan Blue  wrote:

> Chetan,
>
> Spark is currently using Hive 1.2.1 to interact with the Metastore. Using
> that version for Hive is going to be the most reliable, but the metastore
> API doesn't change very often and we've found (from having different
> versions as well) that older versions are mostly compatible. Some things
> fail occasionally, but we haven't had too many problems running different
> versions with the same metastore in practice.
>
> rb
>
> On Wed, Dec 28, 2016 at 4:22 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Users / Developers,
>>
>> I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which
>> version is more compatible with Spark 2.0.2 ?
>>
>> THanks
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [External] Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-04 Thread Ben Teeuwen
Another option: https://github.com/mysql-time-machine/replicator

>From the readme:
"Replicates data changes from MySQL binlog to HBase or Kafka. In case of
HBase, preserves the previous data versions. HBase storage is intended for
auditing purposes of historical data. In addition, special daily-changes
tables can be maintained in HBase, which are convenient for fast and cheap
imports from HBase to Hive. Replication to Kafka is intended for easy
real-time access to a stream of data changes."

On Tue, Jan 3, 2017 at 10:39 PM, Yuanzhe Yang  wrote:

> Hi Ayan,
>
> This "inline view" idea is really awesome and enlightens me! Finally I
> have a plan to move on. I greatly appreciate your help!
>
> Best regards,
> Yang
>
> 2017-01-03 18:14 GMT+01:00 ayan guha :
>
>> Ahh I see what you meanI confused two terminologiesbecause we
>> were talking about partitioning and then changed topic to identify changed
>> data 
>>
>> For that, you can "construct" a dbtable as an inline view -
>>
>> viewSQL = "(select * from table where  >
>> '')".replace("","inserted_on").
>> replace("",checkPointedValue)
>> dbtable =viewSQL
>>
>> refer to this
>> 
>> blog...
>>
>> So, in summary, you have 2 things
>>
>> 1. Identify changed data - my suggestion to use dbtable with inline view
>> 2. parallelism - use numPartition,lowerbound,upper bound to generate
>> number of partitions
>>
>> HTH
>>
>>
>>
>> On Wed, Jan 4, 2017 at 3:46 AM, Yuanzhe Yang  wrote:
>>
>>> Hi Ayan,
>>>
>>> Yeah, I understand your proposal, but according to here
>>> http://spark.apache.org/docs/latest/sql-programming-gui
>>> de.html#jdbc-to-other-databases, it says
>>>
>>> Notice that lowerBound and upperBound are just used to decide the
>>> partition stride, not for filtering the rows in table. So all rows in the
>>> table will be partitioned and returned. This option applies only to reading.
>>>
>>> So my interpretation is all rows in the table are ingested, and this
>>> "lowerBound" and "upperBound" is the span of each partition. Well, I am not
>>> a native English speaker, maybe it means differently?
>>>
>>> Best regards,
>>> Yang
>>>
>>> 2017-01-03 17:23 GMT+01:00 ayan guha :
>>>
 Hi

 You need to store and capture the Max of the column you intend to use
 for identifying new records (Ex: INSERTED_ON) after every successful run of
 your job. Then, use the value in lowerBound option.

 Essentially, you want to create a query like

 select * from table where INSERTED_ON > lowerBound and
 INSERTED_ON>>>
 everytime you run the job



 On Wed, Jan 4, 2017 at 2:13 AM, Yuanzhe Yang  wrote:

> Hi Ayan,
>
> Thanks a lot for your suggestion. I am currently looking into sqoop.
>
> Concerning your suggestion for Spark, it is indeed parallelized with
> multiple workers, but the job is one-off and cannot keep streaming.
> Moreover, I cannot specify any "start row" in the job, it will always
> ingest the entire table. So I also cannot simulate a streaming process by
> starting the job in fix intervals...
>
> Best regards,
> Yang
>
> 2017-01-03 15:06 GMT+01:00 ayan guha :
>
>> Hi
>>
>> While the solutions provided by others looks promising and I'd like
>> to try out few of them, our old pal sqoop already "does" the job. It has 
>> a
>> incremental mode where you can provide a --check-column and
>> --last-modified-value combination to grab the data - and yes, sqoop
>> essentially does it by running a MAP-only job which spawns number of
>> parallel map task to grab data from DB.
>>
>> In Spark, you can use sqlContext.load function for JDBC and use
>> partitionColumn and numPartition to define parallelism of connection.
>>
>> Best
>> Ayan
>>
>> On Tue, Jan 3, 2017 at 10:49 PM, Yuanzhe Yang 
>> wrote:
>>
>>> Hi Ayan,
>>>
>>> Thanks a lot for such a detailed response. I really appreciate it!
>>>
>>> I think this use case can be generalized, because the data is
>>> immutable and append-only. We only need to find one column or timestamp 
>>> to
>>> track the last row consumed in the previous ingestion. This pattern 
>>> should
>>> be common when storing sensor data. If the data is mutable, then the
>>> solution will be surely difficult and vendor specific as you said.
>>>
>>> The workflow you proposed is very useful. The difficulty part is how
>>> to parallelize the ingestion task. With Spark when I have multiple 
>>> workers
>>> working on the same job, I don't know if there is a way and how to
>>> dynamically change the row range each worker should process in 
>>>