How long should logistic regression take on this data?

2016-05-05 Thread Bibudh Lahiri
Hi,
  I am doing the following exercise: I have 100 million labeled records
(total 2.7 GB data) in LibSVM (sparse) format, split across 200 files on
HDFS (each file ~14 MB), so each file has about 500K records. Only 50K of
these 100 million are labeled as "positive", and the rest are all
"negative". I am taking a sample of 50K from the "negative" set, merging it
with the 50K positive, and splitting it into 50% training and 50% test set.
I am training an Elastic Net logistic regression (without regularization)
on the training dataset, testing its performance on the 50K test
datapoints, and then applying the model on the rest of the data (100
million - 100K) to find the class-conditional probabilities of those
examples being positive.

  I have a 2-node cluster, one of them set up as master and both of them
workers, each node having 10 GB executor memory and the driver having 10 GB
memory. My Hadoop cluster is with the same machines as my Spark cluster. My
Spak application is aborting after running for more than 3 hours, and it is
not even reaching the logistic regression part in these 3 hours -  it is
all into the sampling, filtering and merging. Any ballpark about how long
it should take? Are there some known benchmarks for logistic regression?

-- 
Bibudh Lahiri
Senior Data Scientist, Impetus Technolgoies
720 University Avenue, Suite 130
Los Gatos, CA 95129
http://knowthynumbers.blogspot.com/


Re: Writing output of key-value Pair RDD

2016-05-05 Thread Afshartous, Nick

Answering my own question.


I filtered out the keys from the output file by overriding


  MultipleOutputFormat.generateActualKey


to return the empty string.

--

Nick


class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat {

@Override
protected String generateFileNameForKeyValue(String key, String value, 
String name) {
return key;
}

@Override
protected String generateActualKey(String key, String value) {
return "";
}

}


From: Afshartous, Nick 
Sent: Thursday, May 5, 2016 3:35:17 PM
To: Nicholas Chammas; user@spark.apache.org
Subject: Re: Writing output of key-value Pair RDD



Thanks, I got the example below working.  Though it writes both the keys and 
values to the output file.

Is there any way to write just the values ?

--

Nick


String[] strings = { "Abcd", "Azlksd", "whhd", "wasc", "aDxa" };

sc.parallelize(Arrays.asList(strings))

.mapToPair(pairFunction)
.saveAsHadoopFile("s3://...", String.class, String.class, 
RDDMultipleTextOutputFormat.class);



From: Nicholas Chammas 
Sent: Wednesday, May 4, 2016 4:21:12 PM
To: Afshartous, Nick; user@spark.apache.org
Subject: Re: Writing output of key-value Pair RDD

You're looking for this discussion: http://stackoverflow.com/q/23995040/877069

Also, a simpler alternative with DataFrames: 
https://github.com/apache/spark/pull/8375#issuecomment-202458325

On Wed, May 4, 2016 at 4:09 PM Afshartous, Nick 
> wrote:

Hi,


Is there any way to write out to S3 the values of a f key-value Pair RDD ?


I'd like each value of a pair to be written to its own file where the file name 
corresponds to the key name.


Thanks,

--

Nick


Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
Does that code even compile?  I'm assuming eventLogJson.foreach is
supposed to be eventLogJson.foreachRDD ?
I'm also confused as to why you're repartitioning to 1 partition.

Is your streaming job lagging behind (especially given that you're
basically single-threading it by repartitioning to 1 partition)?

Have you looked for any error logs or failed tasks during the time you
noticed missing messages?

Have you verified that you aren't attempting to overwrite hdfs paths?


On Thu, May 5, 2016 at 2:09 PM, Jerry Wong  wrote:
> Hi Cody,
>
> Thank you for quick response my question. I paste the main part of the code,
>
> val sparkConf = new SparkConf().setAppName("KafkaSparkConsumer")
>
> sparkConf.set("spark.cassandra.connection.host", "...")
> sparkConf.set("spark.broadcast.factory",
> "org.apache.spark.broadcast.HttpBroadcastFactory")
> sparkConf.set("spark.cores.max", args(0))
> sparkConf.set("spark.executor.memory", args(1))
> val kafka_broker = args(2)
> val kafka_topic = args(3)
> val hdfs_path = args(4)
> val ssc = new StreamingContext(sparkConf, 2)
> val topicsSet = Set[String](kafka_topic)
> val kafkaParams = Map[String, String]("metadata.broker.list" →
> kafka_broker)
> val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, topicsSet)
> val lines = messages.repartition(1).map({ case (w, c) ⇒ (c)})
> val eventLogJson  = lines.filter(line => line.contains("eventType"))
>
> val eventlog =eventLogJson.foreach(json => {
> if(!json.isEmpty()){
> json.saveAsTextFile(hdfs_path+"/eventlogs/"+getTimeFormatToFile())
> }
> })
> ssc.start()
> ssc.awaitTermination()
>}
>def getTimeFormatToFile(): String = {
> val dateFormat =new SimpleDateFormat("_MM_dd_HH_mm_ss")
>val dt = new Date()
> val cg= new GregorianCalendar()
> cg.setTime(dt);
> return dateFormat.format(cg.getTime())
>   }
>
> Any information is needs?
>
> Thanks!
>
> On Thu, May 5, 2016 at 12:34 PM, Cody Koeninger  wrote:
>>
>> That's not much information to go on.  Any relevant code sample or log
>> messages?
>>
>> On Thu, May 5, 2016 at 11:18 AM, Jerry  wrote:
>> > Hi,
>> >
>> > Does anybody give me an idea why the data is lost at the Kafka Consumer
>> > side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2.
>> > Sometimes, I
>> > found out I could not receive the same number of data with Kafka
>> > producer.
>> > Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed
>> > the
>> > same number in the Broker. But when I checked either HDFS or Cassandra,
>> > the
>> > number is just 363. The data is not always lost, just sometimes...
>> > That's
>> > wired and annoying to me.
>> > Can anybody give me some reasons?
>> >
>> > Thanks!
>> > Jerry
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
That's not much information to go on.  Any relevant code sample or log messages?

On Thu, May 5, 2016 at 11:18 AM, Jerry  wrote:
> Hi,
>
> Does anybody give me an idea why the data is lost at the Kafka Consumer
> side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I
> found out I could not receive the same number of data with Kafka producer.
> Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed the
> same number in the Broker. But when I checked either HDFS or Cassandra, the
> number is just 363. The data is not always lost, just sometimes... That's
> wired and annoying to me.
> Can anybody give me some reasons?
>
> Thanks!
> Jerry
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Content-based Recommendation Engine

2016-05-05 Thread Sree Eedupuganti
Can anyone share the code for Content-based Recommendation Engine to
recommend the user based on E-mail subject.

-- 
Best Regards,
Sreeharsha Eedupuganti
Data Engineer
innData Analytics Private Limited


Re: Spark Streaming, Batch interval, Windows length and Sliding Interval settings

2016-05-05 Thread Mich Talebzadeh
Thanks Ryan for the correction. Posted to the wrong user list :(



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 5 May 2016 at 19:35, Ryan Harris  wrote:

> This is really outside of the scope of Hive and would probably be better
> addressed by the Spark community, however I can say that this very much
> depends on your use case
>
>
>
> Take a look at this discussion if you haven't already:
>
> https://groups.google.com/forum/embed/#!topic/spark-users/GQoxJHAAtX4
>
>
>
> Generally speaking, the larger the batch window, the better the overall
> performance, but the streaming data output will be updated less
> frequently.you will likely run into problems setting your batch window
> < 0.5 sec, and/or when the batch window < the amount of time it takes to
> run the task
>
>
>
> Beyond that, the window length and sliding interval need to be multiples
> of the batch window, but will depend entirely on your reporting
> requirements.
>
>
>
> it would be perfectly reasonable to have
>
> batch window = 30 secs
>
> window length = 1 hour
>
> sliding interval = 5 mins
>
>
>
> In that case, you'd be creating an output every 5 mins, aggregating data
> that you were collecting every 30 seconds over a previous 1 hour period of
> time...
>
>
>
> could you set the batch window to 5 mins?  Possibly, depending on the data
> source, but perhaps you are already using that source on a more frequent
> basis elsewhereor maybe you only have a 1 min buffer on the source
> datalots of possibilities, which is why there is the flexibility and no
> hard/fast rule
>
>
>
> If you were trying to create continuously streaming output as fast as
> possible, then you would probably (almost always) be setting your sliding
> interval = batch window and then shrinking the batch window as short as
> possible.
>
>
>
> More documentation here:
>
>
> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>
>
>
>
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Thursday, May 05, 2016 4:26 AM
> *To:* user
> *Subject:* Re: Spark Streaming, Batch interval, Windows length and
> Sliding Interval settings
>
>
>
> Any ideas/experience on this?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 4 May 2016 at 21:45, Mich Talebzadeh  wrote:
>
> Hi,
>
>
>
> Just wanted opinions on this.
>
>
>
> In Spark streaming the parameter
>
>
>
> val ssc = new StreamingContext(sparkConf, Seconds(n))
>
>
>
> defines the batch or sample interval for the incoming streams
>
>
>
> In addition there is windows Length
>
>
>
> // window length - The duration of the window below that must be multiple
> of batch interval n in = > StreamingContext(sparkConf, Seconds(n))
>
>
>
> val windowLength = L
>
>
>
> And fibally the sliding interval
>
> // sliding interval - The interval at which the window operation is
> performed
>
>
>
> val slidingInterval = I
>
>
>
> OK so as given the windowLength  L = multiples of n and the
> slidingInterval has to be consistent to ensure that we can the head and
> tail of the window.
>
>
>
> So as a heuristic approach for a batch interval of say 10 seconds, I put
> the windows length at 3 times  that = 30 seconds and make the
> slidinginterval = batch interval = 10.
>
>
>
> Obviously these are subjective depending on what is being measured.
> However, I believe having slidinginterval = batch interval makes sense?
>
>
>
> Appreciate any views on this.
>
>
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


Accessing JSON array in Spark SQL

2016-05-05 Thread Xinh Huynh
Hi,

I am having trouble accessing an array element in JSON data with a
dataframe. Here is the schema:

val json1 = """{"f1":"1", "f1a":[{"f2":"2"}] } }"""
val rdd1 = sc.parallelize(List(json1))
val df1 = sqlContext.read.json(rdd1)
df1.printSchema()

root |-- f1: string (nullable = true) |-- f1a: array (nullable = true) |
|-- element: struct (containsNull = true) | | |-- f2: string (nullable =
true)

I would expect to be able to select the first element of "f1a" this way:
df1.select("f1a[0]").show()

org.apache.spark.sql.AnalysisException: cannot resolve 'f1a[0]' given input
columns f1, f1a;

This is with Spark 1.6.0.

Please help. A follow-up question is: can I access arbitrary levels of
nested JSON array of struct of array of struct?

Thanks,
Xinh


Re: Accessing JSON array in Spark SQL

2016-05-05 Thread Michael Armbrust
use df.selectExpr to evaluate complex expression (instead of just column
names).

On Thu, May 5, 2016 at 11:53 AM, Xinh Huynh  wrote:

> Hi,
>
> I am having trouble accessing an array element in JSON data with a
> dataframe. Here is the schema:
>
> val json1 = """{"f1":"1", "f1a":[{"f2":"2"}] } }"""
> val rdd1 = sc.parallelize(List(json1))
> val df1 = sqlContext.read.json(rdd1)
> df1.printSchema()
>
> root |-- f1: string (nullable = true) |-- f1a: array (nullable = true) |
> |-- element: struct (containsNull = true) | | |-- f2: string (nullable =
> true)
>
> I would expect to be able to select the first element of "f1a" this way:
> df1.select("f1a[0]").show()
>
> org.apache.spark.sql.AnalysisException: cannot resolve 'f1a[0]' given
> input columns f1, f1a;
>
> This is with Spark 1.6.0.
>
> Please help. A follow-up question is: can I access arbitrary levels of
> nested JSON array of struct of array of struct?
>
> Thanks,
> Xinh
>


mesos cluster mode

2016-05-05 Thread satish saley
Hi,
Spark documentation says that "cluster mode is currently not supported for
Mesos clusters."But below we can see mesos example with cluster mode. I
don't have mesos cluster to try it out. Which one is true? Shall I
interpret it as "cluster mode is currently not supported for Mesos clusters*
for Python applications*" ?

"Alternatively, if your application is submitted from a machine far from
the worker machines (e.g. locally on your laptop), it is common to use
cluster mode to minimize network latency between the drivers and the
executors. Note that cluster mode is currently not supported for Mesos
clusters. Currently only YARN supports cluster mode for Python
applications."



# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000


Re: Missing data in Kafka Consumer

2016-05-05 Thread Jerry
Hi David,

Thank you for your response.
Before inserting to Cassandra, I had checked the data have already missed
at HDFS (My second step is to load data from HDFS and then insert to
Cassandra).

Can you send me the link relating this bug of 0.8.2?

Thank you!
Jerry

On Thu, May 5, 2016 at 12:38 PM, david.lewis [via Apache Spark User List] <
ml-node+s1001560n26888...@n3.nabble.com> wrote:

> It's possible Kafka is throwing an exception and erroneously returning
> acks (there is a known bug in 0.8.2 that I encountered when my harddisk
> that was keeping log files and holding the temporary snappy library was
> full).
> It's also possible that your messages are not unique when they are put
> into cassandra. Are all of your messages unique in they primary keys in
> your cassandra table?
>
> On Thu, May 5, 2016 at 10:18 AM, Jerry [via Apache Spark User List] <[hidden
> email] > wrote:
>
>> Hi,
>>
>> Does anybody give me an idea why the data is lost at the Kafka Consumer
>> side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes,
>> I found out I could not receive the same number of data with Kafka
>> producer. Exp) I sent 1000 data to Kafka Broker via Kafka Producer and
>> confirmed the same number in the Broker. But when I checked either HDFS or
>> Cassandra, the number is just 363. The data is not always lost, just
>> sometimes... That's wired and annoying to me.
>> Can anybody give me some reasons?
>>
>> Thanks!
>> Jerry
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> 
>>
>
>
>
> --
> -David Lewis
>
> *Blyncsy, Inc.* |www.Blyncsy.com  
>
> This email contains confidential commercial information the disclosure of
> which would result in serious competitive and commercial injury.  As such,
> it is a protected record under the Utah Government Records Access
> Management Act.
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. Please send us by fax any message containing deadlines as incoming
> e-mails are not screened for response deadlines. The integrity and security
> of this message cannot be guaranteed on the Internet.
>
>  P Please consider the environment before printing this email.
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887p26888.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Missing data in Kafka Consumer, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887p26890.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Content-based Recommendation Engine

2016-05-05 Thread Chan Yi Sheng(Eason)
Dear Sree,

here's a simple content-based recommendation engine I built using LDA.

Demo site: http://54.183.251.139:8080/
Github link: https://github.com/easonchan1213/LDA_RecEngine


Cheers,
Chan Yi Sheng

2016-05-05 17:34 GMT+01:00 Sree Eedupuganti :

> Can anyone share the code for Content-based Recommendation Engine to
> recommend the user based on E-mail subject.
>
> --
> Best Regards,
> Sreeharsha Eedupuganti
> Data Engineer
> innData Analytics Private Limited
>



-- 

*詹益昇 Chan Yi Sheng (Eason)Mobile(Taiwan):+886-920-209-324*

*Mobile(UK) : +44 (0)7933 400731E-mail:easonchan1...@gmail.com
 *


Re: groupBy and store in parquet

2016-05-05 Thread Michal Vince

Hi Xinh

For (1) the biggest problem are those null columns. e.g. DF will have 
~1000 columns so every partition of that DF will have ~1000 columns, one 
of the partitioned columns can have 996 null columns which is big waste 
of space (in my case more than 80% in avg)


for (2) I can`t really change anything as the source belongs to the 3rd 
party



Miso


On 05/04/2016 05:21 PM, Xinh Huynh wrote:

Hi**Michal,

For (1), would it be possible to partitionBy two columns to reduce the 
size? Something like partitionBy("event_type", "date").


For (2), is there a way to separate the different event types 
upstream, like on different Kafka topics, and then process them 
separately?


Xinh

On Wed, May 4, 2016 at 7:47 AM, Michal Vince > wrote:


Hi guys

I`m trying to store kafka stream with ~5k events/s as efficiently
as possible in parquet format to hdfs.

I can`t make any changes to kafka (belongs to 3rd party)


Events in kafka are in json format, but the problem is there are
many different event types (from different subsystems with
different number of fields, different size etc..) so it doesn`t
make any sense to store them in the same file


I was trying to read data to DF and then repartition it by
event_type and store


events.write.partitionBy("event_type").format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save(tmpFolder)

which is quite fast but have 2 drawbacks that I`m aware of

1. output folder has only one partition which can be huge

2. all DFs created like this share the same schema, so even dfs
with few fields have tons of null fields


My second try is bit naive and really really slow (you can see why
in code) - filter DF by event type and store them temporarily as
json (to get rid of null fields)

val event_types =events.select($"event_type").distinct().collect() // get 
event_types in this batch

for (row <- event_types) {
   val currDF =events.filter($"event_type" === row.get(0))
   val tmpPath =tmpFolder + row.get(0)
   
currDF.write.format("json").mode(org.apache.spark.sql.SaveMode.Append).save(tmpPath)
   sqlContext.read.json(tmpPath).write.format("parquet").save(basePath)

}
hdfs.delete(new Path(tmpFolder),true)


Do you have any suggestions for any better solution to this?

thanks







Re: Mllib using model to predict probability

2016-05-05 Thread ndjido
You can user the BinaryClassificationEvaluator class to get both predicted 
classes (0/1) and probabilities. Check the following spark doc 
https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html .


Cheers,
Ardo 

Sent from my iPhone

> On 05 May 2016, at 07:59, colin  wrote:
> 
> In 2-class problems, when I use SVM, RondomForest models to do
> classifications, they predict "0" or "1".
> And when I use ROC to evaluate the model, sometimes I need a probability
> that a record belongs to "0" or "1".
> In scikit-learn, every model can do "predict" and "predict_prob", which the
> last one can ouput the probability.
> I find the document, and didn't found this function in MLLIB. 
> Does mllib has this function?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-using-model-to-predict-probability-tp26886.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Mllib using model to predict probability

2016-05-05 Thread colin
In 2-class problems, when I use SVM, RondomForest models to do
classifications, they predict "0" or "1".
And when I use ROC to evaluate the model, sometimes I need a probability
that a record belongs to "0" or "1".
In scikit-learn, every model can do "predict" and "predict_prob", which the
last one can ouput the probability.
I find the document, and didn't found this function in MLLIB. 
Does mllib has this function?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-using-model-to-predict-probability-tp26886.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DeepSpark: where to start

2016-05-05 Thread Mark Vervuurt
Wel you got me fooled as wel ;)
Had it on my todolist to dive into this new component...

Mark

> Op 5 mei 2016 om 07:06 heeft Derek Chan  het volgende 
> geschreven:
> 
> The blog post is a April Fool's joke. Read the last line in the post:
> 
> https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html
> 
> 
> 
>> On Thursday, May 05, 2016 10:42 AM, Joice Joy wrote:
>> I am trying to find info on deepspark. I read the article on databricks blog 
>> which doesnt mention a git repo but does say its open source.
>> Help me find the git repo for this. I found two and not sure which one is
>> the databricks deepspark:
>> https://github.com/deepspark/deepspark
>> https://github.com/nearbydelta/deepspark
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Access S3 bucket using IAM roles

2016-05-05 Thread Jyotiska
Hi,

I am trying to access my S3 bucket. Is it possible to access the bucket and
files inside it without using secret access key and access key id, by using
the IAM role? I am able to do the same in boto where I do not pass secret
key and key id while connecting, but it is able to connect using IAM. Is
the same possible for PySpark?

Thanks.


package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
Hi,

Is there any package or project in Spark/scala which supports Data Quality
check?
For instance checking null values , foreign key constraint

Would really appreciate ,if somebody has already done it and happy to share
or has any open source package .


Thanks,
Divya


Re: package for data quality in Spark 1.5.2

2016-05-05 Thread Mich Talebzadeh
Hi,

Spark is a query tool. It stores data in HDFS or Hive database or anything
else but does not have its own generic database

nulls values and foreign key constraint belong to the domain of databases.
What is exactly the nature of your requirements? Do you want to use Spark
tool to look at the DDL and relationship in the underlying storage layer?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 5 May 2016 at 11:51, Divya Gehlot  wrote:

> Hi,
>
> Is there any package or project in Spark/scala which supports Data Quality
> check?
> For instance checking null values , foreign key constraint
>
> Would really appreciate ,if somebody has already done it and happy to
> share or has any open source package .
>
>
> Thanks,
> Divya
>


Fwd: package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
I am looking for something similar to above solution .
-- Forwarded message --
From: "Divya Gehlot" 
Date: May 5, 2016 6:51 PM
Subject: package for data quality in Spark 1.5.2
To: "user @spark" 
Cc:

Hi,

Is there any package or project in Spark/scala which supports Data Quality
check?
For instance checking null values , foreign key constraint

Would really appreciate ,if somebody has already done it and happy to share
or has any open source package .


Thanks,
Divya


Re: package for data quality in Spark 1.5.2

2016-05-05 Thread Mich Talebzadeh
ok thanks let me check it.

So your primary storage layer is Hbase with Phoenix as a tool.

Sounds interesting. I will get back to you on this

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 5 May 2016 at 13:26, Divya Gehlot  wrote:

>
> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
> I am looking for something similar to above solution .
> -- Forwarded message --
> From: "Divya Gehlot" 
> Date: May 5, 2016 6:51 PM
> Subject: package for data quality in Spark 1.5.2
> To: "user @spark" 
> Cc:
>
> Hi,
>
> Is there any package or project in Spark/scala which supports Data Quality
> check?
> For instance checking null values , foreign key constraint
>
> Would really appreciate ,if somebody has already done it and happy to
> share or has any open source package .
>
>
> Thanks,
> Divya
>


?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-05 Thread sunday2000
Hi,
  I built spark 1.6.1 on Linux redhat 2.6.32-279.el6.x86_64 server, with JDK: 
jdk1.8.0_91 




--  --
??: "Divya Gehlot";;
: 2016??5??5??(??) 10:41
??: "sunday2000"<2314476...@qq.com>; 
: "user"; "user"; 
: Re: spark 1.6.1 build failure of : scala-maven-plugin



Hi,
My Javac version 


C:\Users\Divya>javac -version
javac 1.7.0_79



C:\Users\Divya>java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)



Do I need use higher version ?




Thanks,
Divya 



On 4 May 2016 at 21:31, sunday2000 <2314476...@qq.com> wrote:
Check your javac version, and update it.




--  --
??: "Divya Gehlot";;
: 2016??5??4??(??) 11:25
??: "sunday2000"<2314476...@qq.com>; 
: "user"; "user"; 
: Re: spark 1.6.1 build failure of : scala-maven-plugin



Hi ,Even I am getting the similar error 
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile

When I tried to build Phoenix Project using maven .
Maven version : 3.3
Java version - 1.7_67
Phoenix - downloaded latest master from Git hub
If anybody find the the resolution please share.




Thanks,
Divya 


On 3 May 2016 at 10:18, sunday2000 <2314476...@qq.com> wrote:
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 14.765 s
[INFO] Finished at: 2016-05-03T10:08:46+08:00
[INFO] Final Memory: 35M/191M
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
... 20 more
Caused by: Compile failed via zinc server
at 
sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
at 
sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
at 
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
at 

Re: DeepSpark: where to start

2016-05-05 Thread Joice Joy
What the heck, I was already beginning to like it.

On Thu, May 5, 2016 at 12:31 PM, Mark Vervuurt 
wrote:

> Wel you got me fooled as wel ;)
> Had it on my todolist to dive into this new component...
>
> Mark
>
> > Op 5 mei 2016 om 07:06 heeft Derek Chan  het
> volgende geschreven:
> >
> > The blog post is a April Fool's joke. Read the last line in the post:
> >
> >
> https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html
> >
> >
> >
> >> On Thursday, May 05, 2016 10:42 AM, Joice Joy wrote:
> >> I am trying to find info on deepspark. I read the article on databricks
> blog which doesnt mention a git repo but does say its open source.
> >> Help me find the git repo for this. I found two and not sure which one
> is
> >> the databricks deepspark:
> >> https://github.com/deepspark/deepspark
> >> https://github.com/nearbydelta/deepspark
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: groupBy and store in parquet

2016-05-05 Thread Xinh Huynh
Hi Michal,

Why is your solution so slow? Is it from the file IO caused by storing in a
temp file as JSON and then reading it back in and writing it as Parquet?
How are you getting "events" in the first place?

Do you have the original Kafka messages as an RDD[String]? Then how about:

1. Start with eventsAsRDD : RDD[String] (before converting to DF)
2. eventsAsRDD.map() --> use a RegEx to parse out the event_type of each
event
 For example, search the string for {"event_type"="[.*]"}
3. Now, filter by each event_type to create a separate RDD for each type,
and convert those to DF. You only convert to DF for events of the same
type, so you avoid the NULLs.

Xinh


On Thu, May 5, 2016 at 2:52 AM, Michal Vince  wrote:

> Hi Xinh
>
> For (1) the biggest problem are those null columns. e.g. DF will have
> ~1000 columns so every partition of that DF will have ~1000 columns, one of
> the partitioned columns can have 996 null columns which is big waste of
> space (in my case more than 80% in avg)
>
> for (2) I can`t really change anything as the source belongs to the 3rd
> party
>
>
> Miso
>
> On 05/04/2016 05:21 PM, Xinh Huynh wrote:
>
> Hi Michal,
>
> For (1), would it be possible to partitionBy two columns to reduce the
> size? Something like partitionBy("event_type", "date").
>
> For (2), is there a way to separate the different event types upstream,
> like on different Kafka topics, and then process them separately?
>
> Xinh
>
> On Wed, May 4, 2016 at 7:47 AM, Michal Vince 
> wrote:
>
>> Hi guys
>>
>> I`m trying to store kafka stream with ~5k events/s as efficiently as
>> possible in parquet format to hdfs.
>>
>> I can`t make any changes to kafka (belongs to 3rd party)
>>
>>
>> Events in kafka are in json format, but the problem is there are many
>> different event types (from different subsystems with different number of
>> fields, different size etc..) so it doesn`t make any sense to store them in
>> the same file
>>
>>
>> I was trying to read data to DF and then repartition it by event_type and
>> store
>>
>> events.write.partitionBy("event_type").format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save(tmpFolder)
>>
>> which is quite fast but have 2 drawbacks that I`m aware of
>>
>> 1. output folder has only one partition which can be huge
>>
>> 2. all DFs created like this share the same schema, so even dfs with few
>> fields have tons of null fields
>>
>>
>> My second try is bit naive and really really slow (you can see why in
>> code) - filter DF by event type and store them temporarily as json (to get
>> rid of null fields)
>>
>> val event_types = events.select($"event_type").distinct().collect() // get 
>> event_types in this batch
>> for (row <- event_types) {
>>   val currDF = events.filter($"event_type" === row.get(0))
>>   val tmpPath = tmpFolder + row.get(0)
>>   
>> currDF.write.format("json").mode(org.apache.spark.sql.SaveMode.Append).save(tmpPath)
>>   sqlContext.read.json(tmpPath).write.format("parquet").save(basePath)
>>
>> }hdfs.delete(new Path(tmpFolder), true)
>>
>>
>> Do you have any suggestions for any better solution to this?
>>
>> thanks
>>
>>
>>
>
>


Re: Writing output of key-value Pair RDD

2016-05-05 Thread Afshartous, Nick

Thanks, I got the example below working.  Though it writes both the keys and 
values to the output file.

Is there any way to write just the values ?

--

Nick


String[] strings = { "Abcd", "Azlksd", "whhd", "wasc", "aDxa" };

sc.parallelize(Arrays.asList(strings))

.mapToPair(pairFunction)
.saveAsHadoopFile("s3://...", String.class, String.class, 
RDDMultipleTextOutputFormat.class);



From: Nicholas Chammas 
Sent: Wednesday, May 4, 2016 4:21:12 PM
To: Afshartous, Nick; user@spark.apache.org
Subject: Re: Writing output of key-value Pair RDD

You're looking for this discussion: http://stackoverflow.com/q/23995040/877069

Also, a simpler alternative with DataFrames: 
https://github.com/apache/spark/pull/8375#issuecomment-202458325

On Wed, May 4, 2016 at 4:09 PM Afshartous, Nick 
> wrote:

Hi,


Is there any way to write out to S3 the values of a f key-value Pair RDD ?


I'd like each value of a pair to be written to its own file where the file name 
corresponds to the key name.


Thanks,

--

Nick


SortWithinPartitions on DataFrame

2016-05-05 Thread Darshan Singh
Hi,

I have a dataframe df1 and I partitioned it by col1,col2 and persisted it.
Then I created new dataframe df2.

val df2 = df1.sortWithinPartitions("col1","col2","col3")

df1.persist()
df2.persist()
df1.count()
df2.count()

now I expect that any group by statement using the "col1","col2","col3"
should be way too fast in the df2 as compared to df1. I expect that there
should not be any shuffle in df2 as data is already sorted and sum should
be done on the same machine which has the partition.

e.g.
df1.groupBy("col1","col2","col3").sum("col4","col5").collect()
should be suffled as we know that the data is not sorted.

df2.groupBy("col1","col2","col3").sum("col4","col5").collect()
shouldnt cause any shuffle as data within partition is sorted the way we
need.

However, this doesn't seem to be the case in 1.6.1.


Am i missing something?

Thanks


Re: H2O + Spark Streaming?

2016-05-05 Thread ndjido
Sure! Check the following working example : 
https://github.com/h2oai/qcon2015/tree/master/05-spark-streaming/ask-craig-streaming-app
 

Cheers.
Ardo

Sent from my iPhone

> On 05 May 2016, at 17:26, diplomatic Guru  wrote:
> 
> Hello all, I was wondering if it is possible to use H2O with Spark Streaming 
> for online prediction? 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Disable parquet metadata summary in

2016-05-05 Thread Bijay Kumar Pathak
Hi,

How can we disable writing _common_metdata while saving Data Frame in
parquet format in PySpark. I tried to set the property using below command
but didn't helped.

sparkContext._jsc.hadoopConfiguration().set("parquet.enable.summary-metadata",
"false")


Thanks,
Bijay


Missing data in Kafka Consumer

2016-05-05 Thread Jerry
Hi,

Does anybody give me an idea why the data is lost at the Kafka Consumer
side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I
found out I could not receive the same number of data with Kafka producer.
Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed the
same number in the Broker. But when I checked either HDFS or Cassandra, the
number is just 363. The data is not always lost, just sometimes... That's
wired and annoying to me. 
Can anybody give me some reasons? 

Thanks!
Jerry  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Content-based Recommendation Engine

2016-05-05 Thread Sree Eedupuganti
Can anyone share the code for Content-based Recommendation Engine to
recommend the user based on E-mail subject.

-- 
Best Regards,
Sreeharsha Eedupuganti
Data Engineer
innData Analytics Private Limited


Re: [Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
But why ? Any specific reason behind it ?
I am aware of that we can persist the dataframes but before proceeding
would like to know the memory level of my DFs.
I am working on performance tuning of my Spark jobs , looking for Storage
Level APIs like RDDs.




Thanks,
Divya

On 6 May 2016 at 11:16, Ted Yu  wrote:

> I am afraid there is no such API.
>
> When persisting, you can specify StorageLevel :
>
>   def persist(newLevel: StorageLevel): this.type = {
>
> Can you tell us your use case ?
>
> Thanks
>
> On Thu, May 5, 2016 at 8:06 PM, Divya Gehlot 
> wrote:
>
>> Hi,
>> How can I get and set storage level for Dataframes like RDDs ,
>> as mentioned in following  book links
>>
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
>>
>>
>>
>> Thanks,
>> Divya
>>
>
>


unsubscribe

2016-05-05 Thread Brindha Sengottaiyan



[Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
Hi,
How can I get and set storage level for Dataframes like RDDs ,
as mentioned in following  book links
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html



Thanks,
Divya


Re: [Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Ted Yu
I am afraid there is no such API.

When persisting, you can specify StorageLevel :

  def persist(newLevel: StorageLevel): this.type = {

Can you tell us your use case ?

Thanks

On Thu, May 5, 2016 at 8:06 PM, Divya Gehlot 
wrote:

> Hi,
> How can I get and set storage level for Dataframes like RDDs ,
> as mentioned in following  book links
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
>
>
>
> Thanks,
> Divya
>


Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-05 Thread pratik gawande
Hello,

I am new to spark. For one of  job I am finding significant performance 
difference when run in pyspark vs scala. Could you please let me know if this 
is known and scala is preferred over python for writing spark jobs? Also DAG 
visualization shows completely different DAGs for scala and pyspark. I have 
pasted DAG for both using toDebugString() method. Let me know if you need any 
additional information.

Time for Job in scala : 52 secs

Time for job in pyspark : 4.2 min


Scala code in Zepplin:

val lines = sc.textFile("s3://[test-bucket]/output2/")
val words = lines.flatMap(line => line.split(" "))
val filteredWords = words.filter(word => word.equals("Gutenberg") || 
word.equals("flower") || word.equals("a"))
val wordMap = filteredWords.map(word => (word, 1)).reduceByKey(_ + _)
wordMap.collect()

pyspark code in Zepplin:

lines = sc.textFile("s3://[test-bucket]/output2/")
words = lines.flatMap(lambda x: x.split())
filteredWords = words.filter(lambda x: (x == "Gutenberg" or x == "flower" or x 
== "a"))
result = filteredWords.map(lambda x: (x, 1)).reduceByKey(lambda a,b: 
a+b).collect()
print result


Scala final RDD:

print wordMap.toDebugString()

 lines: org.apache.spark.rdd.RDD[String] = s3://[test-bucket]/output2/ 
MapPartitionsRDD[108] at textFile at :30 words: 
org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[109] at flatMap at 
:31 filteredWords: org.apache.spark.rdd.RDD[String] = 
MapPartitionsRDD[110] at filter at :33 wordMap: 
org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[112] at reduceByKey at 
:35 (10) ShuffledRDD[112] at reduceByKey at :35 [] +-(10) 
MapPartitionsRDD[111] at map at :35 [] | MapPartitionsRDD[110] at 
filter at :33 [] | MapPartitionsRDD[109] at flatMap at :31 [] 
| s3://[test-bucket]/output2/ MapPartitionsRDD[108] at textFile at :30 
[] | s3://[test-bucket]/output2/ HadoopRDD[107] at textFile at :30 []


PySpark final RDD:

println(wordMap.toDebugString)

(10) PythonRDD[119] at RDD at PythonRDD.scala:43 [] | 
s3://[test-bucket]/output2/ MapPartitionsRDD[114] at textFile at null:-1 [] | 
s3://[test-bucket]/output2/HadoopRDD[113] at textFile at null:-1 [] 
PythonRDD[120] at RDD at PythonRDD.scala:43


Thanks,

Pratik


Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-05 Thread Saisai Shao
Writing RDD based application using pyspark will bring in additional
overheads, Spark is running on the JVM whereas your python code is running
on python runtime, so data should be communicated between JVM world and
python world, this requires additional serialization-deserialization, IPC.
Also other parts will bring in overheads. So the performance difference is
expected, but you could tune the application to reduce the gap.

Also because python RDD wraps a lot, so the DAG you saw is different from
Scala, that is also expected.

Thanks
Saisai


On Fri, May 6, 2016 at 12:47 PM, pratik gawande 
wrote:

> Hello,
>
> I am new to spark. For one of  job I am finding significant performance
> difference when run in pyspark vs scala. Could you please let me know if
> this is known and scala is preferred over python for writing spark jobs?
> Also DAG visualization shows completely different DAGs for scala and
> pyspark. I have pasted DAG for both using toDebugString() method. Let me
> know if you need any additional information.
>
> *Time for Job in scala* : 52 secs
>
> *Time for job in pyspark *: 4.2 min
>
>
> *Scala code in Zepplin:*
>
> val lines = sc.textFile("s3://[test-bucket]/output2/")
> val words = lines.flatMap(line => line.split(" "))
> val filteredWords = words.filter(word => word.equals("Gutenberg") ||
> word.equals("flower") || word.equals("a"))
> val wordMap = filteredWords.map(word => (word, 1)).reduceByKey(_ + _)
> wordMap.collect()
>
> *pyspark code in Zepplin:*
>
> lines = sc.textFile("s3://[test-bucket]/output2/")
> words = lines.flatMap(lambda x: x.split())
> filteredWords = words.filter(lambda x: (x == "Gutenberg" or x == "flower"
> or x == "a"))
> result = filteredWords.map(lambda x: (x, 1)).reduceByKey(lambda a,b:
> a+b).collect()
> print result
>
> *Scala final RDD:*
>
>
> *print wordMap.toDebugString() *
>
>  lines: org.apache.spark.rdd.RDD[String] = s3://[test-bucket]/output2/
> MapPartitionsRDD[108] at textFile at :30 words:
> org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[109] at flatMap at
> :31 filteredWords: org.apache.spark.rdd.RDD[String] =
> MapPartitionsRDD[110] at filter at :33 wordMap:
> org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[112] at reduceByKey
> at :35 (10) ShuffledRDD[112] at reduceByKey at :35 []
> +-(10) MapPartitionsRDD[111] at map at :35 [] |
> MapPartitionsRDD[110] at filter at :33 [] | MapPartitionsRDD[109]
> at flatMap at :31 [] | s3://[test-bucket]/output2/
> MapPartitionsRDD[108] at textFile at :30 [] | s3://[test-bucket]/
> output2/ HadoopRDD[107] at textFile at :30 []
>
>
> *PySpark final RDD:*
>
>
> *println(wordMap.toDebugString) *
>
> (10) PythonRDD[119] at RDD at PythonRDD.scala:43 [] | s3://[test-bucket]/
> output2/ MapPartitionsRDD[114] at textFile at null:-1 [] |
> s3://[test-bucket]/output2/HadoopRDD[113] at textFile at null:-1 []
> PythonRDD[120] at RDD at PythonRDD.scala:43
>
>
> Thanks,
>
> Pratik
>


H2O + Spark Streaming?

2016-05-05 Thread diplomatic Guru
Hello all, I was wondering if it is possible to use H2O with Spark
Streaming for online prediction?


Could we use Sparkling Water Lib with Spark Streaming

2016-05-05 Thread diplomatic Guru
Hello all, I was wondering if it is possible to use H2O with Spark
Streaming for online prediction?


Re: DeepSpark: where to start

2016-05-05 Thread Jason Nerothin
Just so that there is no confusion, there is a Spark user interface project
called DeepSense that is actually useful: http://deepsense.io

I am not affiliated with them in any way...

On Thu, May 5, 2016 at 9:42 AM, Joice Joy  wrote:

> What the heck, I was already beginning to like it.
>
> On Thu, May 5, 2016 at 12:31 PM, Mark Vervuurt 
> wrote:
>
>> Wel you got me fooled as wel ;)
>> Had it on my todolist to dive into this new component...
>>
>> Mark
>>
>> > Op 5 mei 2016 om 07:06 heeft Derek Chan  het
>> volgende geschreven:
>> >
>> > The blog post is a April Fool's joke. Read the last line in the post:
>> >
>> >
>> https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html
>> >
>> >
>> >
>> >> On Thursday, May 05, 2016 10:42 AM, Joice Joy wrote:
>> >> I am trying to find info on deepspark. I read the article on
>> databricks blog which doesnt mention a git repo but does say its open
>> source.
>> >> Help me find the git repo for this. I found two and not sure which one
>> is
>> >> the databricks deepspark:
>> >> https://github.com/deepspark/deepspark
>> >> https://github.com/nearbydelta/deepspark
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Individual DStream Checkpointing in Spark Streaming

2016-05-05 Thread Akash Mishra
Hi *,

I am little confused over the checkpointing of Spark Streaming Context and
Individual Streaming context.

E.g:

JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(1));

jssc.checkpoint("hdfs://...")


Will start checkpointing the Dstream operation, configuration & incomplete
info to HDFS. As i understand this will not checkpoint any DStream RDD into
the HDFS.

We also have option of individually checkpointing any DStream in the
streaming context. When we start checkpointing individual DStrem, all the
RDD associated with the DStream will be checkpoint into HDFS.

JavaDStream transformedWindow =
udsEventStream.window(windowDuration, aggDuration)
.transform(transformer); transformedWindow.checkpoint(aggDuration);



Questions:

1. What will be benifits of individually checkpointing each stream?
2. When the source of stream input is from HDFS, does backing up individual
stream will provide any benifits ?
3. How does Spark uses Individual Stream Checkpoint to become fault
tolerant?
4. According to Spark Documentation,
" For stateful transformations that require RDD checkpointing, the default
interval is a multiple of the batch interval that is at least 10 seconds.
It can be set by using dstream.checkpoint(checkpointInterval). Typically, a
checkpoint interval of 5 - 10 sliding intervals of a DStream is a good
setting to try."

Does all the stateful trasformation are already checkpointed?



Thanks,

-- 

Regards,
Akash Mishra.


"It's not our abilities that make us, but our decisions."--Albus Dumbledore