Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-10 Thread Jacek Laskowski
"Something like that" I've never tried it out myself so I'm only
guessing having a brief look at the API.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Feb 11, 2017 at 1:31 AM, Egor Pahomov  wrote:
> Jacek, so I create cache in ForeachWriter, in all "process()" I write to it
> and on close I flush? Something like that?
>
> 2017-02-09 12:42 GMT-08:00 Jacek Laskowski :
>>
>> Hi,
>>
>> Yes, that's ForeachWriter.
>>
>> Yes, it works with element by element. You're looking for mapPartition
>> and ForeachWriter has partitionId that you could use to implement a
>> similar thing.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Thu, Feb 9, 2017 at 3:55 AM, Egor Pahomov 
>> wrote:
>> > Jacek, you mean
>> >
>> > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter
>> > ? I do not understand how to use it, since it passes every value
>> > separately,
>> > not every partition. And addding to table value by value would not work
>> >
>> > 2017-02-07 12:10 GMT-08:00 Jacek Laskowski :
>> >>
>> >> Hi,
>> >>
>> >> Have you considered foreach sink?
>> >>
>> >> Jacek
>> >>
>> >> On 6 Feb 2017 8:39 p.m., "Egor Pahomov"  wrote:
>> >>>
>> >>> Hi, I'm thinking of using Structured Streaming instead of old
>> >>> streaming,
>> >>> but I need to be able to save results to Hive table. Documentation for
>> >>> file
>> >>> sink
>> >>>
>> >>> says(http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks):
>> >>> "Supports writes to partitioned tables. ". But being able to write to
>> >>> partitioned directories is not enough to write to the table: someone
>> >>> needs
>> >>> to write to Hive metastore. How can I use Structured Streaming and
>> >>> write to
>> >>> Hive table?
>> >>>
>> >>> --
>> >>> Sincerely yours
>> >>> Egor Pakhomov
>> >
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Egor Pakhomov
>
>
>
>
> --
> Sincerely yours
> Egor Pakhomov

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-10 Thread Egor Pahomov
Jacek, so I create cache in ForeachWriter, in all "process()" I write to it
and on close I flush? Something like that?

2017-02-09 12:42 GMT-08:00 Jacek Laskowski :

> Hi,
>
> Yes, that's ForeachWriter.
>
> Yes, it works with element by element. You're looking for mapPartition
> and ForeachWriter has partitionId that you could use to implement a
> similar thing.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Feb 9, 2017 at 3:55 AM, Egor Pahomov 
> wrote:
> > Jacek, you mean
> > http://spark.apache.org/docs/latest/api/scala/index.html#
> org.apache.spark.sql.ForeachWriter
> > ? I do not understand how to use it, since it passes every value
> separately,
> > not every partition. And addding to table value by value would not work
> >
> > 2017-02-07 12:10 GMT-08:00 Jacek Laskowski :
> >>
> >> Hi,
> >>
> >> Have you considered foreach sink?
> >>
> >> Jacek
> >>
> >> On 6 Feb 2017 8:39 p.m., "Egor Pahomov"  wrote:
> >>>
> >>> Hi, I'm thinking of using Structured Streaming instead of old
> streaming,
> >>> but I need to be able to save results to Hive table. Documentation for
> file
> >>> sink
> >>> says(http://spark.apache.org/docs/latest/structured-
> streaming-programming-guide.html#output-sinks):
> >>> "Supports writes to partitioned tables. ". But being able to write to
> >>> partitioned directories is not enough to write to the table: someone
> needs
> >>> to write to Hive metastore. How can I use Structured Streaming and
> write to
> >>> Hive table?
> >>>
> >>> --
> >>> Sincerely yours
> >>> Egor Pakhomov
> >
> >
> >
> >
> > --
> > Sincerely yours
> > Egor Pakhomov
>



-- 


*Sincerely yoursEgor Pakhomov*


Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan
Hi Nick,
Because we use *RandomSignProjectionLSH*, there is only one parameter for
LSH is the number of hashes. I try with small number of hashes (2) but the
error is still happens. And it happens when I call similarity join. After
transformation, the size of  dataset is about 4G.

2017-02-11 3:07 GMT+07:00 Nick Pentreath :

> What other params are you using for the lsh transformer?
>
> Are the issues occurring during transform or during the similarity join?
>
>
> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan 
> wrote:
>
>> hi Das,
>> In general, I will apply them to larger datasets, so I want to use LSH,
>> which is more scaleable than the approaches as you suggested. Have you
>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>> parameters/configuration to make it work ?
>> Thanks.
>>
>> 2017-02-10 19:21 GMT+07:00 Debasish Das :
>>
>> If it is 7m rows and 700k features (or say 1m features) brute force row
>> similarity will run fine as well...check out spark-4823...you can compare
>> quality with approximate variant...
>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:
>>
>> Hi everyone,
>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>> to find approximately nearest neighbors. Basically, We have dataset with
>> about 7M rows. we want to use cosine distance to meassure the similarity
>> between items, so we use *RandomSignProjectionLSH* (
>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
>> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
>> as serialization, memory fraction, executor memory (~6G), number of
>> executors ( ~20), memory overhead ..., but nothing works. I often get error
>> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
>> this implementation is done by engineer at Uber but I don't know right
>> configurations,.. to run the algorithm at scale. Do they need very big
>> memory to run it?
>>
>> Any help would be appreciated.
>> Thanks
>>
>>
>>


Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Nick Pentreath
What other params are you using for the lsh transformer?

Are the issues occurring during transform or during the similarity join?


On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan  wrote:

> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
> which is more scaleable than the approaches as you suggested. Have you
> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
> parameters/configuration to make it work ?
> Thanks.
>
> 2017-02-10 19:21 GMT+07:00 Debasish Das :
>
> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:
>
> Hi everyone,
> Since spark 2.1.0 introduces LSH (
> http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
> we want to use LSH to find approximately nearest neighbors. Basically, We
> have dataset with about 7M rows. we want to use cosine distance to meassure
> the similarity between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>
>
>


Re: Strange behavior with 'not' and filter pushdown

2017-02-10 Thread Everett Anderson
Bumping this thread.

Translating "where not(username is not null)" into a filter of
[IsNotNull(username),
Not(IsNotNull(username))] seems like a rather severe bug.

Spark 1.6.2:

explain select count(*) from parquet_table where not( username is not null)

== Physical Plan ==
TungstenAggregate(key=[],
functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#1822L])
+- TungstenExchange SinglePartition, None
 +- TungstenAggregate(key=[],
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#1825L])
 +- Project
 +- Filter NOT isnotnull(username#1590)
 +- Scan ParquetRelation[username#1590] InputPaths: ,
PushedFilters: [Not(IsNotNull(username))]

Spark 2.0.2

explain select count(*) from parquet_table where not( username is not null)

== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
 +- *HashAggregate(keys=[], functions=[partial_count(1)])
 +- *Project
 +- *Filter (isnotnull(username#35) && NOT isnotnull(username#35))
 +- *BatchedScan parquet default.[username#35] Format:
ParquetFormat, InputPaths: , PartitionFilters: [],
PushedFilters: [IsNotNull(username), Not(IsNotNull(username))], ReadSchema:
struct

Example to generate the above:

// Create some fake data

import org.apache.spark.sql.Row
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.types._

val rowsRDD = sc.parallelize(Seq(
Row(1, "fred"),
Row(2, "amy"),
Row(3, null)))

val schema = StructType(Seq(
StructField("id", IntegerType, nullable = true),
StructField("username", StringType, nullable = true)))

val data = sqlContext.createDataFrame(rowsRDD, schema)

val path = "SOME PATH HERE"

data.write.mode("overwrite").parquet(path)

val testData = sqlContext.read.parquet(path)

testData.registerTempTable("filter_test_table")


%sql
explain select count(*) from filter_test_table where not( username is not
null)


On Wed, Feb 8, 2017 at 4:56 PM, Alexi Kostibas 
wrote:

> Hi,
>
> I have an application where I’m filtering data with SparkSQL with simple
> WHERE clauses. I also want the ability to show the unmatched rows for any
> filter, and so am wrapping the previous clause in `NOT()` to get the
> inverse. Example:
>
> Filter:  username is not null
> Inverse filter:  NOT(username is not null)
>
> This worked fine in Spark 1.6. After upgrading to Spark 2.0.2, the inverse
> filter always returns zero results. It looks like this is a problem with
> how the filter is getting pushed down to Parquet. Specifically, the
> pushdown includes both the “is not null” filter, AND “not(is not null)”,
> which would obviously result in zero matches. An example below:
>
> pyspark:
> > x = spark.sql('select my_id from my_table where *username is not null*')
> > y = spark.sql('select my_id from my_table where not(*username is not
> null*)')
>
> > x.explain()
> == Physical Plan ==
> *Project [my_id#6L]
> +- *Filter isnotnull(username#91)
>+- *BatchedScan parquet default.my_table[my_id#6L,username#91]
>Format: ParquetFormat, InputPaths: s3://my-path/my.parquet,
>PartitionFilters: [], PushedFilters: [IsNotNull(username)],
>ReadSchema: struct
> [1159]> y.explain()
> == Physical Plan ==
> *Project [my_id#6L]
> +- *Filter (isnotnull(username#91) && NOT isnotnull(username#91))username
>+- *BatchedScan parquet default.my_table[my_id#6L,username#91]
>Format: ParquetFormat, InputPaths: s3://my-path/my.parquet,
>PartitionFilters: [],
>PushedFilters: [IsNotNull(username), Not(IsNotNull(username))],
> username
>ReadSchema: struct
>
> Presently I’m working around this by using the new functionality of NOT
> EXISTS in Spark 2, but that seems like overkill.
>
> Any help appreciated.
>
>
> *Alexi Kostibas*Engineering
> *Nuna*
> 650 Townsend Street, Suite 425
> San Francisco, CA 94103
>
>


Getting exit code of pipe()

2017-02-10 Thread Xuchen Yao
Hello Community,

I have the following Python code that calls an external command:

rdd.pipe('run.sh', env=os.environ).collect()

run.sh can either exit with status 1 or 0, how could I get the exit code
from Python? Thanks!

Xuchen


Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-10 Thread Cosmin Posteuca
Thank you very much for your answers, Now i understand better what i have
to do!  Thank you!

On Wed, 8 Feb 2017 at 22:37, Gourav Sengupta 
wrote:

> Hi,
>
> I am not quite sure of your used case here, but I would use spark-submit
> and submit sequential jobs as steps to an EMR cluster.
>
>
> Regards,
> Gourav
>
> On Wed, Feb 8, 2017 at 11:10 AM, Cosmin Posteuca <
> cosmin.poste...@gmail.com> wrote:
>
> I tried to run some test on EMR on yarn cluster mode.
>
> I have a cluster with 16 cores(8 processors with 2 threads each). If i run
> one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both
> finished in 170 seconds. If i run 3 jobs simultaneous, all three finished
> in 240 seconds.
>
> If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240
> seconds, and next 3 jobs finish in 480 seconds from cluster start time. But
> that doesn’t happened. My firs job finished after 120 second, second
> finished after 180 seconds, third finished after 240 second, the fourth and
> the fifth finished simultaneous after 360 seconds, and the last finished
> after 400 seconds.
>
> I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a
> combination of FIFO and FAIR.
>
> Is this the correct behavior of spark?
>
> Thank you!
>
> 2017-02-08 9:29 GMT+02:00 Gourav Sengupta :
>
> Hi,
>
> Michael's answer will solve the problem in case you using only SQL based
> solution.
>
> Otherwise please refer to the wonderful details mentioned here
> https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0
> released  SPARK 2.1.0 is available in AWS.
>
> (note that there is an issue with using zeppelin in it and I have raised
> it as an issue to AWS and they are looking into it now)
>
> Regards,
> Gourav Sengupta
>
> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel 
> wrote:
>
>
>
>
>
>
>
>
>
>
>
> Why couldn’t you use the spark thrift server?
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca 
> wrote:
>
>
>
>
>
>
>
> answer for Gourav Sengupta
>
>
>
>
>
> I want to use same spark application because i want to work as a FIFO
> scheduler. My problem is that i have many jobs(not so big) and if i run an
> application for every job my cluster will split resources as a FAIR
> scheduler(it's what i observe, maybe i'm wrong)
>
> and exist the possibility to create bottleneck effect. The start time
> isn't a problem for me, because it isn't a real-time application.
>
>
>
>
>
> I need a business solution, that's the reason why i can't use code from
> github.
>
>
>
>
>
> Thanks!
>
>
>
>
>
> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta
>
> :
>
>
>
>
>
>
> Hi,
>
>
>
>
>
>
>
> May I ask the reason for using the same spark application? Is it because
> of the time it takes in order to start a spark context?
>
>
>
>
>
>
> On another note you may want to look at the number of contributors in a
> github repo before choosing a solution.
>
>
>
>
>
>
>
>
>
>
>
>
> Regards,
>
>
> Gourav
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski
>
>  wrote:
>
>
>
>
> Spark jobserver or Livy server are the best options for pure technical API.
>
> If you want to publish business API you will probably have to build you
> own app like the one I wrote a year ago
>
>
>
> https://github.com/elppc/akka-spark-experiments
>
>
> It combines Akka actors and a shared Spark context to serve concurrent
> subsecond jobs
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2017-02-07 15:28 GMT+01:00 ayan guha
>
> :
>
>
>
>
> I think you are loking for livy or spark  jobserver
>
>
>
>
>
>
>
>
> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca 
> wrote:
>
>
>
>
>
>
>
>
>
>
> I want to run different jobs on demand with same spark context, but i
> don't know how exactly i can do this.
>
>
>
>
> I try to get current context, but seems it create a new spark context(with
> new executors).
>
>
>
>
> I call spark-submit to add new jobs.
>
>
>
>
> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with
> yarn as resource manager.
>
>
>
>
> My code:
>
>
> val sparkContext = SparkContext.getOrCreate()
>
> val content = 1 to 4
>
> val result = sparkContext.parallelize(content, 5)
>
> result.map(value => value.toString).foreach(loop)
>
>
>
> def loop(x: String): Unit = {
>
>for (a <- 1 to 3000) {
>
>
>
>}
>
> }
>
>
>
>
>
> spark-submit:
>
>
> spark-submit --executor-cores 1 \
>
>  --executor-memory 1g \
>
>  --driver-memory 1g \
>
>  --master yarn \
>
>  --deploy-mode cluster \
>
>  --conf spark.dynamicAllocation.enabled=true \
>
>  --conf spark.shuffle.service.enabled=true \
>
>  --conf spark.dynamicAllocation.minExecutors=1 \
>
>  --conf 

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-10 Thread Ryan Blue
This isn't related to the progress bar, it just happened while in that
section of code. Something else is taking memory in the driver, usually a
broadcast table or something else that requires a lot of memory and happens
on the driver.

You should check your driver memory settings and the query plan (if this
was SparkSQL) for this stage to investigate further.

rb

On Thu, Feb 9, 2017 at 8:41 PM, John Fang 
wrote:

> the spark version is 2.1.0
>
> --
> 发件人:方孝健(玄弟) 
> 发送时间:2017年2月10日(星期五) 12:35
> 收件人:spark-dev ; spark-user 
> 主 题:Driver hung and happend out of memory while writing to console
> progress bar
>
> [Stage 172:==> (10328 + 93) / 
> 16144][Stage 172:==> (10329 + 93) 
> / 16144][Stage 172:==> (10330 + 
> 93) / 16144][Stage 172:==> (10331 
> + 93) / 16144][Stage 172:==> 
> (10333 + 92) / 16144][Stage 172:==>   
>   (10333 + 93) / 16144][Stage 172:==> 
> (10333 + 94) / 16144][Stage 172:==>   
>   (10334 + 94) / 16144][Stage 172:==> 
> (10338 + 93) / 16144][Stage 172:==>   
>   (10339 + 92) / 16144][Stage 172:==> 
> (10340 + 93) / 16144][Stage 172:==>   
>   (10341 + 92) / 16144][Stage 172:==> 
> (10341 + 93) / 16144][Stage 
> 172:==> (10342 + 93) / 
> 16144][Stage 172:==> (10343 + 93) 
> / 16144][Stage 172:==> (10344 + 
> 92) / 16144][Stage 172:==> (10345 
> + 92) / 16144][Stage 172:==> 
> (10345 + 93) / 16144][Stage 172:==>   
>   (10346 + 93) / 16144][Stage 172:==> 
> (10348 + 92) / 16144][Stage 172:==>   
>   (10348 + 93) / 16144][Stage 172:==> 
> (10349 + 92) / 16144][Stage 172:==>   
>   (10349 + 93) / 16144][Stage 172:==> 
> (10350 + 92) / 16144][Stage 172:==>   
>   (10352 + 92) / 16144][Stage 172:==> 
> (10353 + 92) / 16144][Stage 
> 172:==> (10354 + 92) / 
> 16144][Stage 172:==> (10355 + 92) 
> / 16144][Stage 172:==> (10356 + 
> 92) / 16144][Stage 172:==> (10356 
> + 93) / 16144][Stage 172:==> 
> (10357 + 92) / 16144][Stage 172:==>   
>   (10357 + 93) / 16144][Stage 172:==> 
> (10358 + 92) / 16144][Stage 172:==>   
>   (10358 + 93) / 16144][Stage 172:==> 
> (10359 + 92) / 16144][Stage 172:==>   
>   (10359 + 93) / 16144][Stage 172:==> 
> (10359 + 94) / 16144][Stage 172:==>   
>   (10361 + 92) / 16144][Stage 172:==> 
> (10361 + 93) / 16144][Stage 
> 172:==> (10362 + 92) / 
> 16144][Stage 172:==> (10362 + 93) 
> / 16144][Stage 172:==> (10363 + 
> 93) / 16144][Stage 172:==> (10364 
> + 92) / 16144][Stage 172:==> 
> (10365 + 92) / 16144][Stage 172:==>   
>   (10365 + 93) / 16144][Stage 172:==> 
> (10366 + 92) / 16144][Stage 172:==>   
>   (10366 + 93) / 16144][Stage 172:==> 
> (10367 + 92) / 16144][Stage 172:==>   
>   (10367 + 93) / 16144][Stage 172:==> 
> (10367 + 93) / 16144][Stage 172:==>   
>  

SQL warehouse dir

2017-02-10 Thread Joseph Naegele

Hi all,

I've read the docs for Spark SQL 2.1.0 but I'm still having issues with the 
warehouse and related details.

I'm not using Hive proper, so my hive-site.xml consists only of:

javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=/mnt/data/spark/metastore_db;create=true

I've set "spark.sql.warehouse.dir" in my "spark-defaults.conf", however the 
location in my catalog doesn't match:

scala> spark.conf.get("spark.sql.warehouse.dir")
res8: String = file://mnt/data/spark/warehouse

scala> spark.conf.get("hive.metastore.warehouse.dir")
res9: String = file://mnt/data/spark/warehouse

scala> spark.catalog.listDatabases.show(false)
+---+-+-+
|name   |description  |locationUri |
+---+-+-+
|default|Default Hive database|file:/home/me/spark-warehouse|
+---+-+-+

I've also tried setting "spark.sql.warehouse.dir" to a valid HDFS path to no 
avail.

My application loads both ORC tables and AVRO files (using spark-avro) from 
HDFS.
When I load a table using spark.sql("select * from orc.`my-table-in-hdfs`"), I 
see WARN ObjectStore: Failed to get database orc, returning NoSuchObjectException.
When I load an AVRO file from HDFS using spark.read.avro(filename) , I see WARN 
DataSource: Error while looking for metadata directory.

Any ideas as to what I'm doing wrong?

--
Joe Naegele
Grier Forensics
410.220.0968


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan
hi Das,
In general, I will apply them to larger datasets, so I want to use LSH,
which is more scaleable than the approaches as you suggested. Have you
tried LSH in Spark 2.1.0 before ? If yes, how do you set the
parameters/configuration to make it work ?
Thanks.

2017-02-10 19:21 GMT+07:00 Debasish Das :

> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:
>
>> Hi everyone,
>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>> to find approximately nearest neighbors. Basically, We have dataset with
>> about 7M rows. we want to use cosine distance to meassure the similarity
>> between items, so we use *RandomSignProjectionLSH* (
>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
>> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
>> as serialization, memory fraction, executor memory (~6G), number of
>> executors ( ~20), memory overhead ..., but nothing works. I often get error
>> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
>> this implementation is done by engineer at Uber but I don't know right
>> configurations,.. to run the algorithm at scale. Do they need very big
>> memory to run it?
>>
>> Any help would be appreciated.
>> Thanks
>>
>


Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Debasish Das
If it is 7m rows and 700k features (or say 1m features) brute force row
similarity will run fine as well...check out spark-4823...you can compare
quality with approximate variant...
On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:

> Hi everyone,
> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
> to find approximately nearest neighbors. Basically, We have dataset with
> about 7M rows. we want to use cosine distance to meassure the similarity
> between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>


HDFS Shell tool

2017-02-10 Thread Vitásek , Ladislav
Hello Spark fans,
I would like to inform you about our tool we want to share in big data
community. I think it can be also handy for Spark users.

We created a new utility - HDFS Shell to work with HDFS data more easily.

https://github.com/avast/hdfs-shell

*Feature highlights*
- HDFS DFS command initiates JVM for each command call, HDFS Shell does it
only once - which means great speed enhancement when you need to work with
HDFS more often
- Commands can be used in a short way - eg. *hdfs dfs -ls /*, *ls /* - both
will work
- *HDFS path completion using TAB key*
- you can easily add any other HDFS manipulation function
- there is a command history persisting in history log
(~/.hdfs-shell/hdfs-shell.log)
- support for relative directory + commands *cd* and *pwd*
- it can be also launched as a daemon (using UNIX domain sockets)
- 100% Java, it's open source

Your suggestions are welcome.

-L. Vitasek aka Vity


Write JavaDStream to Kafka (how?)

2017-02-10 Thread Gutwein, Sebastian
Hi,


I'am new to Spark-Streaming and want to run some end-to-end-tests with Spark 
and Kafka.

My program is running but at the kafka topic nothing arrives. Can someone 
please help me?

Where is my mistake, has someone a runnig example of writing a DStream to Kafka 
0.10.1.0?


The program looks like follows:

import kafka.Kafka;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.*;
import org.apache.spark.rdd.RDD;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Int;
import scala.Tuple2;

import java.util.*;
import java.util.regex.Pattern;

/**
 * Consumes messages from one or more topics in Kafka and does wordcount.
 *
 * Usage: JavaKafkaWordCount
 *is a list of one or more zookeeper servers that make quorum
 *is the name of kafka consumer group
 *is a list of one or more kafka topics to consume from
 *is the number of threads the kafka consumer should use
 *
 * To run this example:
 *   `$ bin/run-example org.apache.spark.examples.streaming.JavaKafkaWordCount 
zoo01,zoo02, \
 *zoo03 my-consumer-group topic1,topic2 1`
 */

public final class JavaKafkaWordCountTest {
  private static final Pattern SPACE = Pattern.compile(" ");

  private JavaKafkaWordCountTest() {
  }

  public static void main(String[] args) throws Exception {
if (args.length < 4) {
  System.err.println("Usage: JavaKafkaWordCount
");
  System.exit(1);
}

SparkConf sparkConf = new SparkConf().setAppName("GutweinKafkaWordCount");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new 
Duration(2000));

int numThreads = Integer.parseInt(args[3]);
Map topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
  topicMap.put(topic, numThreads);
}

final JavaPairReceiverInputDStream messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);

JavaDStream lines = messages.map(new Function, String>() {
  @Override
  public String call(Tuple2 tuple2) {
return tuple2._2();
  }
});

JavaDStream words = lines.flatMap(new FlatMapFunction() {
  @Override
  public Iterator call(String x) {
return Arrays.asList(SPACE.split(x)).iterator();
  }
});

JavaPairDStream wordCounts = words.mapToPair(
  new PairFunction() {
@Override
public Tuple2 call(String s) {
  return new Tuple2<>(s, 1);
}
  }).reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
  return i1 + i2;
}
  });

final KafkaWriter writer = new KafkaWriter("localhost:9081");

wordCounts.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaPairRDD stringIntegerJavaPairRDD) 
throws Exception {
writer.writeToTopic("output", stringIntegerJavaPairRDD.toString());
}
});

wordCounts.print();
jssc.start();
jssc.awaitTermination();
  }

  public static class KafkaWriter {
Properties props = new Properties();
KafkaProducer producer;

// Constructor
KafkaWriter(String bootstrap_server){
  props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server);
  props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, 
"org.apache.kafka.common.serialization.StringSerializer");
  props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, 
"org.apache.kafka.common.serialization.StringSerializer");
  producer = new KafkaProducer(props);
}


private void writeToTopic(String topicName, String message){
  ProducerRecord record = new ProducerRecord<>(topicName, 
message);
  producer.send(record);

}

  }

}



Add hive-site.xml at runtime

2017-02-10 Thread Shivam Sharma
Hi,

I have multiple hive configurations(hive-site.xml) and because of that only
I am not able to add any hive configuration in spark *conf* directory. I
want to add this configuration file at start of any *spark-submit* or
*spark-shell*. This conf file is huge so *--conf* is not a option for me.

Thanks

-- 
Shivam Sharma


Re: Add hive-site.xml at runtime

2017-02-10 Thread Shivam Sharma
Did anybody get above mail?

Thanks

On Fri, Feb 10, 2017 at 11:51 AM, Shivam Sharma <28shivamsha...@gmail.com>
wrote:

> Hi,
>
> I have multiple hive configurations(hive-site.xml) and because of that
> only I am not able to add any hive configuration in spark *conf* directory.
> I want to add this configuration file at start of any *spark-submit* or
> *spark-shell*. This conf file is huge so *--conf* is not a option for me.
>
> Thanks
>
> --
> Shivam Sharma
>
>


-- 
Shivam Sharma