Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Cody Koeninger
One thing you should be aware of (that's a showstopper for my use
cases, but may not be for yours) is that you can provide Kafka offsets
to start from, but you can't really get access to offsets and metadata
during the job on a per-batch or per-partition basis, just on a
per-message basis.

On Tue, Nov 1, 2016 at 8:29 PM, Michael Armbrust  wrote:
> Yeah, those are all requests for additional features / version support.
> I've been using kafka with structured streaming to do both ETL into
> partitioned parquet tables as well as streaming event time windowed
> aggregation for several weeks now.
>
> On Tue, Nov 1, 2016 at 6:18 PM, Cody Koeninger  wrote:
>>
>> Look at the resolved subtasks attached to that ticket you linked.
>> Some of them are unresolved, but basic functionality is there.
>>
>> On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande
>>  wrote:
>> > Hi Michael,
>> >
>> > Thanks for the reply.
>> >
>> > The following link says there is a open unresolved Jira for Structured
>> > streaming support for consuming from Kafka.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-15406
>> >
>> > Appreciate your help.
>> >
>> > -Shyla
>> >
>> >
>> > On Tue, Nov 1, 2016 at 5:19 PM, Michael Armbrust
>> > 
>> > wrote:
>> >>
>> >> I'm not aware of any open issues against the kafka source for
>> >> structured
>> >> streaming.
>> >>
>> >> On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande
>> >> 
>> >> wrote:
>> >>>
>> >>> I am building a data pipeline using Kafka, Spark streaming and
>> >>> Cassandra.
>> >>> Wondering if the issues with  Kafka source fixed in Spark 2.0.1. If
>> >>> not,
>> >>> please give me an update on when it may be fixed.
>> >>>
>> >>> Thanks
>> >>> -Shyla
>> >>
>> >>
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Michael Armbrust
Yeah, those are all requests for additional features / version support.
I've been using kafka with structured streaming to do both ETL into
partitioned parquet tables as well as streaming event time windowed
aggregation for several weeks now.

On Tue, Nov 1, 2016 at 6:18 PM, Cody Koeninger  wrote:

> Look at the resolved subtasks attached to that ticket you linked.
> Some of them are unresolved, but basic functionality is there.
>
> On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande
>  wrote:
> > Hi Michael,
> >
> > Thanks for the reply.
> >
> > The following link says there is a open unresolved Jira for Structured
> > streaming support for consuming from Kafka.
> >
> > https://issues.apache.org/jira/browse/SPARK-15406
> >
> > Appreciate your help.
> >
> > -Shyla
> >
> >
> > On Tue, Nov 1, 2016 at 5:19 PM, Michael Armbrust  >
> > wrote:
> >>
> >> I'm not aware of any open issues against the kafka source for structured
> >> streaming.
> >>
> >> On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande <
> deshpandesh...@gmail.com>
> >> wrote:
> >>>
> >>> I am building a data pipeline using Kafka, Spark streaming and
> Cassandra.
> >>> Wondering if the issues with  Kafka source fixed in Spark 2.0.1. If
> not,
> >>> please give me an update on when it may be fixed.
> >>>
> >>> Thanks
> >>> -Shyla
> >>
> >>
> >
>


Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Cody Koeninger
Look at the resolved subtasks attached to that ticket you linked.
Some of them are unresolved, but basic functionality is there.

On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande
 wrote:
> Hi Michael,
>
> Thanks for the reply.
>
> The following link says there is a open unresolved Jira for Structured
> streaming support for consuming from Kafka.
>
> https://issues.apache.org/jira/browse/SPARK-15406
>
> Appreciate your help.
>
> -Shyla
>
>
> On Tue, Nov 1, 2016 at 5:19 PM, Michael Armbrust 
> wrote:
>>
>> I'm not aware of any open issues against the kafka source for structured
>> streaming.
>>
>> On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande 
>> wrote:
>>>
>>> I am building a data pipeline using Kafka, Spark streaming and Cassandra.
>>> Wondering if the issues with  Kafka source fixed in Spark 2.0.1. If not,
>>> please give me an update on when it may be fixed.
>>>
>>> Thanks
>>> -Shyla
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread shyla deshpande
Hi Michael,

Thanks for the reply.

The following link says there is a open unresolved Jira for Structured
streaming support for consuming from Kafka.

https://issues.apache.org/jira/browse/SPARK-15406

Appreciate your help.

-Shyla


On Tue, Nov 1, 2016 at 5:19 PM, Michael Armbrust 
wrote:

> I'm not aware of any open issues against the kafka source for structured
> streaming.
>
> On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande 
> wrote:
>
>> I am building a data pipeline using Kafka, Spark streaming and Cassandra.
>> Wondering if the issues with  Kafka source fixed in Spark 2.0.1. If not,
>> please give me an update on when it may be fixed.
>>
>> Thanks
>> -Shyla
>>
>
>


Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Michael Armbrust
I'm not aware of any open issues against the kafka source for structured
streaming.

On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande 
wrote:

> I am building a data pipeline using Kafka, Spark streaming and Cassandra.
> Wondering if the issues with  Kafka source fixed in Spark 2.0.1. If not,
> please give me an update on when it may be fixed.
>
> Thanks
> -Shyla
>


Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread shyla deshpande
I am building a data pipeline using Kafka, Spark streaming and Cassandra.
Wondering if the issues with  Kafka source fixed in Spark 2.0.1. If not,
please give me an update on when it may be fixed.

Thanks
-Shyla


not table to connect to table using hiveContext

2016-11-01 Thread vinay parekar

Hi there,
   I am trying to get some table data using spark hiveContext. I am getting an 
exception as :


org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table 
rnow_imports_text. null
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1158)
   at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:302)
   at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
   at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
   at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
   at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
   at 
org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:298)
   at 
org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
   at 
org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
   at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
   at 
org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)
   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
   at scala.Option.getOrElse(Option.scala:120)
   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)
   at 
org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:257)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:268)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:264)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at 

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Cool!

So going back to IDF Estimator and Model problem, do you know what an IDF
estimator really does during Fitting process? It must be storing some state
(information) as I mentioned in OP (|D|, DF|t, D| and perhaps TF|t, D|)
that it re-uses to Transform test data (labeled data). Or does it just
maintains a map(lookup) of tokens -> IDF score and uses that to lookup
scores for test data tokens.

Here's one possible thought in context of Naive bayes
Fitting IDF model (idf1) generates conditional probability of a
token(feature) . e.g. let's say IDF of term "software" is 4.5 , so it store
a lookup software -> 4.5
Transforming training data using idf1 basically just creates a dataframe
with above conditional probability vectors for each document
Transforming test data using same idf1 uses a lookup generated above to
create conditional probability vectors for each document. e.g. if it
encounter "software" in test data it's IDF value would be just 4.5

Thanks




On Tue, Nov 1, 2016 at 4:09 PM, ayan guha  wrote:

> Yes, that is correct. I think I misread a part of it in terms of
> scoringI think we both are saying same thing so thats a good thing :)
>
> On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel 
> wrote:
>
>> Hi Ayan,
>>
>> "classification algorithm will for sure need to Fit against new dataset
>> to produce new model" I said this in context of re-training the model. Is
>> it not correct? Isn't it part of re-training?
>>
>> Thanks
>>
>> On Tue, Nov 1, 2016 at 4:01 PM, ayan guha  wrote:
>>
>>> Hi
>>>
>>> "classification algorithm will for sure need to Fit against new dataset
>>> to produce new model" - I do not think this is correct. Maybe we are
>>> talking semantics but AFAIU, you "train" one model using some dataset, and
>>> then use it for scoring new datasets.
>>>
>>> You may re-train every month, yes. And you may run cross validation once
>>> a month (after re-training) or lower freq like once in 2-3 months to
>>> validate model quality. Here, number of months are not important, but you
>>> must be running cross validation and similar sort of "model evaluation"
>>> work flow typically in lower frequency than Re-Training process.
>>>
>>> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel 
>>> wrote:
>>>
 Hi Ayan,
 After deployment, we might re-train it every month. That is whole
 different problem I have explored yet. classification algorithm will for
 sure need to Fit against new dataset to produce new model. Correct me if I
 am wrong but I think I will also FIt new IDF model based on new dataset. At
 that time as well I will follow same training-validation split (or
 corss-validation) to evaluate model performance on new data before
 releasing it to make prediction. So afik , every time you  need to re-train
 model you will need to corss validate using some data split strategy.

 I think spark ML document should start explaining mathematical model or
 simple algorithm what Fit and Transform means for particular algorithm
 (IDF, NaiveBayes)

 Thanks

 On Tue, Nov 1, 2016 at 5:45 AM, ayan guha  wrote:

> I have come across similar situation recently and decided to run
> Training  workflow less frequently than scoring workflow.
>
> In your use case I would imagine you will run IDF fit workflow once in
> say a week. It will produce a model object which will be saved. In scoring
> workflow, you will typically see new unseen dataset and the model 
> generated
> in training flow will be used to score or label this new dataset.
>
> Note, train and test datasets are used during development phase when
> you are trying to find out which model to use and
> efficientcy/performance/accuracy etc. It will never be part of
> workflow. In a little elaborate setting you may want to automate model
> evaluations, but that's a different story.
>
> Not sure if I could explain properly, please feel free to comment.
> On 1 Nov 2016 22:54, "Nirav Patel"  wrote:
>
>> Yes, I do apply NaiveBayes after IDF .
>>
>> " you can re-train (fit) on all your data before applying it to
>> unseen data." Did you mean I can reuse that model to Transform both
>> training and test data?
>>
>> Here's the process:
>>
>> Datasets:
>>
>>1. Full sample data (labeled)
>>2. Training (labeled)
>>3. Test (labeled)
>>4. Unseen (non-labeled)
>>
>> Here are two workflow options I see:
>>
>> Option - 1 (currently using)
>>
>>1. Fit IDF model (idf-1) on full Sample data
>>2. Apply(Transform) idf-1 on full sample data
>>3. Split data set into Training and Test data
>>4. Fit ML model on Training data
>>5. Apply(Transform) model on Test data
>>6. 

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
Yes, that is correct. I think I misread a part of it in terms of
scoringI think we both are saying same thing so thats a good thing :)

On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel  wrote:

> Hi Ayan,
>
> "classification algorithm will for sure need to Fit against new dataset
> to produce new model" I said this in context of re-training the model. Is
> it not correct? Isn't it part of re-training?
>
> Thanks
>
> On Tue, Nov 1, 2016 at 4:01 PM, ayan guha  wrote:
>
>> Hi
>>
>> "classification algorithm will for sure need to Fit against new dataset
>> to produce new model" - I do not think this is correct. Maybe we are
>> talking semantics but AFAIU, you "train" one model using some dataset, and
>> then use it for scoring new datasets.
>>
>> You may re-train every month, yes. And you may run cross validation once
>> a month (after re-training) or lower freq like once in 2-3 months to
>> validate model quality. Here, number of months are not important, but you
>> must be running cross validation and similar sort of "model evaluation"
>> work flow typically in lower frequency than Re-Training process.
>>
>> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel 
>> wrote:
>>
>>> Hi Ayan,
>>> After deployment, we might re-train it every month. That is whole
>>> different problem I have explored yet. classification algorithm will for
>>> sure need to Fit against new dataset to produce new model. Correct me if I
>>> am wrong but I think I will also FIt new IDF model based on new dataset. At
>>> that time as well I will follow same training-validation split (or
>>> corss-validation) to evaluate model performance on new data before
>>> releasing it to make prediction. So afik , every time you  need to re-train
>>> model you will need to corss validate using some data split strategy.
>>>
>>> I think spark ML document should start explaining mathematical model or
>>> simple algorithm what Fit and Transform means for particular algorithm
>>> (IDF, NaiveBayes)
>>>
>>> Thanks
>>>
>>> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha  wrote:
>>>
 I have come across similar situation recently and decided to run
 Training  workflow less frequently than scoring workflow.

 In your use case I would imagine you will run IDF fit workflow once in
 say a week. It will produce a model object which will be saved. In scoring
 workflow, you will typically see new unseen dataset and the model generated
 in training flow will be used to score or label this new dataset.

 Note, train and test datasets are used during development phase when
 you are trying to find out which model to use and
 efficientcy/performance/accuracy etc. It will never be part of
 workflow. In a little elaborate setting you may want to automate model
 evaluations, but that's a different story.

 Not sure if I could explain properly, please feel free to comment.
 On 1 Nov 2016 22:54, "Nirav Patel"  wrote:

> Yes, I do apply NaiveBayes after IDF .
>
> " you can re-train (fit) on all your data before applying it to
> unseen data." Did you mean I can reuse that model to Transform both
> training and test data?
>
> Here's the process:
>
> Datasets:
>
>1. Full sample data (labeled)
>2. Training (labeled)
>3. Test (labeled)
>4. Unseen (non-labeled)
>
> Here are two workflow options I see:
>
> Option - 1 (currently using)
>
>1. Fit IDF model (idf-1) on full Sample data
>2. Apply(Transform) idf-1 on full sample data
>3. Split data set into Training and Test data
>4. Fit ML model on Training data
>5. Apply(Transform) model on Test data
>6. Apply(Transform) idf-1 on Unseen data
>7. Apply(Transform) model on Unseen data
>
> Option - 2
>
>1. Split sample data into Training and Test data
>2. Fit IDF model (idf-1) only on training data
>3. Apply(Transform) idf-1 on training data
>4. Apply(Transform) idf-1 on test data
>5. Fit ML model on Training data
>6. Apply(Transform) model on Test data
>7. Apply(Transform) idf-1 on Unseen data
>8. Apply(Transform) model on Unseen data
>
> So you are suggesting Option-2 in this particular case, right?
>
> On Tue, Nov 1, 2016 at 4:24 AM, Robin East 
> wrote:
>
>> Fit it on training data to evaluate the model. You can either use
>> that model to apply to unseen data or you can re-train (fit) on all your
>> data before applying it to unseen data.
>>
>> fit and transform are 2 different things: fit creates a model,
>> transform applies a model to data to create transformed output. If you 
>> are
>> using your training data in a subsequent step (e.g. running logistic
>> regression 

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Hi Ayan,

"classification algorithm will for sure need to Fit against new dataset to
produce new model" I said this in context of re-training the model. Is it
not correct? Isn't it part of re-training?

Thanks

On Tue, Nov 1, 2016 at 4:01 PM, ayan guha  wrote:

> Hi
>
> "classification algorithm will for sure need to Fit against new dataset
> to produce new model" - I do not think this is correct. Maybe we are
> talking semantics but AFAIU, you "train" one model using some dataset, and
> then use it for scoring new datasets.
>
> You may re-train every month, yes. And you may run cross validation once a
> month (after re-training) or lower freq like once in 2-3 months to validate
> model quality. Here, number of months are not important, but you must be
> running cross validation and similar sort of "model evaluation" work flow
> typically in lower frequency than Re-Training process.
>
> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel  wrote:
>
>> Hi Ayan,
>> After deployment, we might re-train it every month. That is whole
>> different problem I have explored yet. classification algorithm will for
>> sure need to Fit against new dataset to produce new model. Correct me if I
>> am wrong but I think I will also FIt new IDF model based on new dataset. At
>> that time as well I will follow same training-validation split (or
>> corss-validation) to evaluate model performance on new data before
>> releasing it to make prediction. So afik , every time you  need to re-train
>> model you will need to corss validate using some data split strategy.
>>
>> I think spark ML document should start explaining mathematical model or
>> simple algorithm what Fit and Transform means for particular algorithm
>> (IDF, NaiveBayes)
>>
>> Thanks
>>
>> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha  wrote:
>>
>>> I have come across similar situation recently and decided to run
>>> Training  workflow less frequently than scoring workflow.
>>>
>>> In your use case I would imagine you will run IDF fit workflow once in
>>> say a week. It will produce a model object which will be saved. In scoring
>>> workflow, you will typically see new unseen dataset and the model generated
>>> in training flow will be used to score or label this new dataset.
>>>
>>> Note, train and test datasets are used during development phase when you
>>> are trying to find out which model to use and 
>>> efficientcy/performance/accuracy
>>> etc. It will never be part of workflow. In a little elaborate setting you
>>> may want to automate model evaluations, but that's a different story.
>>>
>>> Not sure if I could explain properly, please feel free to comment.
>>> On 1 Nov 2016 22:54, "Nirav Patel"  wrote:
>>>
 Yes, I do apply NaiveBayes after IDF .

 " you can re-train (fit) on all your data before applying it to unseen
 data." Did you mean I can reuse that model to Transform both training and
 test data?

 Here's the process:

 Datasets:

1. Full sample data (labeled)
2. Training (labeled)
3. Test (labeled)
4. Unseen (non-labeled)

 Here are two workflow options I see:

 Option - 1 (currently using)

1. Fit IDF model (idf-1) on full Sample data
2. Apply(Transform) idf-1 on full sample data
3. Split data set into Training and Test data
4. Fit ML model on Training data
5. Apply(Transform) model on Test data
6. Apply(Transform) idf-1 on Unseen data
7. Apply(Transform) model on Unseen data

 Option - 2

1. Split sample data into Training and Test data
2. Fit IDF model (idf-1) only on training data
3. Apply(Transform) idf-1 on training data
4. Apply(Transform) idf-1 on test data
5. Fit ML model on Training data
6. Apply(Transform) model on Test data
7. Apply(Transform) idf-1 on Unseen data
8. Apply(Transform) model on Unseen data

 So you are suggesting Option-2 in this particular case, right?

 On Tue, Nov 1, 2016 at 4:24 AM, Robin East 
 wrote:

> Fit it on training data to evaluate the model. You can either use that
> model to apply to unseen data or you can re-train (fit) on all your data
> before applying it to unseen data.
>
> fit and transform are 2 different things: fit creates a model,
> transform applies a model to data to create transformed output. If you are
> using your training data in a subsequent step (e.g. running logistic
> regression or some other machine learning algorithm) then you need to
> transform your training data using the IDF model before passing it through
> the next step.
>
> 
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> 

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
Hi

"classification algorithm will for sure need to Fit against new dataset to
produce new model" - I do not think this is correct. Maybe we are talking
semantics but AFAIU, you "train" one model using some dataset, and then use
it for scoring new datasets.

You may re-train every month, yes. And you may run cross validation once a
month (after re-training) or lower freq like once in 2-3 months to validate
model quality. Here, number of months are not important, but you must be
running cross validation and similar sort of "model evaluation" work flow
typically in lower frequency than Re-Training process.

On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel  wrote:

> Hi Ayan,
> After deployment, we might re-train it every month. That is whole
> different problem I have explored yet. classification algorithm will for
> sure need to Fit against new dataset to produce new model. Correct me if I
> am wrong but I think I will also FIt new IDF model based on new dataset. At
> that time as well I will follow same training-validation split (or
> corss-validation) to evaluate model performance on new data before
> releasing it to make prediction. So afik , every time you  need to re-train
> model you will need to corss validate using some data split strategy.
>
> I think spark ML document should start explaining mathematical model or
> simple algorithm what Fit and Transform means for particular algorithm
> (IDF, NaiveBayes)
>
> Thanks
>
> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha  wrote:
>
>> I have come across similar situation recently and decided to run
>> Training  workflow less frequently than scoring workflow.
>>
>> In your use case I would imagine you will run IDF fit workflow once in
>> say a week. It will produce a model object which will be saved. In scoring
>> workflow, you will typically see new unseen dataset and the model generated
>> in training flow will be used to score or label this new dataset.
>>
>> Note, train and test datasets are used during development phase when you
>> are trying to find out which model to use and 
>> efficientcy/performance/accuracy
>> etc. It will never be part of workflow. In a little elaborate setting you
>> may want to automate model evaluations, but that's a different story.
>>
>> Not sure if I could explain properly, please feel free to comment.
>> On 1 Nov 2016 22:54, "Nirav Patel"  wrote:
>>
>>> Yes, I do apply NaiveBayes after IDF .
>>>
>>> " you can re-train (fit) on all your data before applying it to unseen
>>> data." Did you mean I can reuse that model to Transform both training and
>>> test data?
>>>
>>> Here's the process:
>>>
>>> Datasets:
>>>
>>>1. Full sample data (labeled)
>>>2. Training (labeled)
>>>3. Test (labeled)
>>>4. Unseen (non-labeled)
>>>
>>> Here are two workflow options I see:
>>>
>>> Option - 1 (currently using)
>>>
>>>1. Fit IDF model (idf-1) on full Sample data
>>>2. Apply(Transform) idf-1 on full sample data
>>>3. Split data set into Training and Test data
>>>4. Fit ML model on Training data
>>>5. Apply(Transform) model on Test data
>>>6. Apply(Transform) idf-1 on Unseen data
>>>7. Apply(Transform) model on Unseen data
>>>
>>> Option - 2
>>>
>>>1. Split sample data into Training and Test data
>>>2. Fit IDF model (idf-1) only on training data
>>>3. Apply(Transform) idf-1 on training data
>>>4. Apply(Transform) idf-1 on test data
>>>5. Fit ML model on Training data
>>>6. Apply(Transform) model on Test data
>>>7. Apply(Transform) idf-1 on Unseen data
>>>8. Apply(Transform) model on Unseen data
>>>
>>> So you are suggesting Option-2 in this particular case, right?
>>>
>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East 
>>> wrote:
>>>
 Fit it on training data to evaluate the model. You can either use that
 model to apply to unseen data or you can re-train (fit) on all your data
 before applying it to unseen data.

 fit and transform are 2 different things: fit creates a model,
 transform applies a model to data to create transformed output. If you are
 using your training data in a subsequent step (e.g. running logistic
 regression or some other machine learning algorithm) then you need to
 transform your training data using the IDF model before passing it through
 the next step.

 
 ---
 Robin East
 *Spark GraphX in Action* Michael Malak and Robin East
 Manning Publications Co.
 http://www.manning.com/books/spark-graphx-in-action





 On 1 Nov 2016, at 11:18, Nirav Patel  wrote:

 Just to re-iterate what you said, I should fit IDF model only on
 training data and then re-use it for both test data and then later on
 unseen data to make predictions.

 On Tue, Nov 1, 2016 at 3:49 

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael Armbrust
registerTempTable is backed by an in-memory hash table that maps table name
(a string) to a logical query plan.  Fragments of that logical query plan
may or may not be cached (but calling register alone will not result in any
materialization of results).  In Spark 2.0 we renamed this function to
createOrReplaceTempView, since a traditional RDBMs view is a better analogy
here.

If I was trying to augment the engine to make better use of HBase's
internal ordering, I'd probably use the experimental ability to inject
extra strategies into the query planner.  Essentially, you could look for
filters on top of BaseRelations (the internal class used to map DataSources
into the query plan) where there is a range filter on some prefix of the
table's key.  When this is detected, you could return an RDD that contains
the already filtered result talking directly to HBase, which would override
the default execution pathway.

I wrote up a (toy) example of using this API
,
which might be helpful.

On Tue, Nov 1, 2016 at 4:11 AM, Mich Talebzadeh 
wrote:

> it would be great if we establish this.
>
> I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are
> private to the session and are put in a hidden staging directory as below
>
> /user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-
> 47_319_5605745346163312826-10
>
> and removed when the session ends or table is dropped
>
> Not sure how Spark handles this.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 November 2016 at 10:50, Michael David Pedersen  googlemail.com> wrote:
>
>> Thanks for the link, I hadn't come across this.
>>
>> According to https://forums.databricks.com/questions/400/what-is-the-diff
>>> erence-between-registertemptable-a.html
>>>
>>> and I quote
>>>
>>> "registerTempTable()
>>>
>>> registerTempTable() creates an in-memory table that is scoped to the
>>> cluster in which it was created. The data is stored using Hive's
>>> highly-optimized, in-memory columnar format."
>>>
>> But then the last post in the thread corrects this, saying:
>> "registerTempTable does not create a 'cached' in-memory table, but rather
>> an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
>> or a reference in Java".
>>
>> So - probably need to dig into the sources to get more clarity on this.
>>
>> Cheers,
>> Michael
>>
>
>


Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
Looks like upgrading to Spark 2.0.1 fixed it! The thread count now when I
do cat /proc/pid/status is about 84 as opposed to a 1000 in the span of 2
mins in Spark 2.0.0

On Tue, Nov 1, 2016 at 11:40 AM, Shixiong(Ryan) Zhu  wrote:

> Yes, try 2.0.1!
>
> On Tue, Nov 1, 2016 at 11:25 AM, kant kodali  wrote:
>
>> AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0
>>
>> On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Dstream "Window" uses "union" to combine multiple RDDs in one window
>>> into a single RDD.
>>>
>>> On Tue, Nov 1, 2016 at 2:59 AM kant kodali  wrote:
>>>
 @Sean It looks like this problem can happen with other RDD's as well.
 Not just unionRDD

 On Tue, Nov 1, 2016 at 2:52 AM, kant kodali  wrote:

 Hi Sean,

 The comments seem very relevant although I am not sure if this pull
 request https://github.com/apache/spark/pull/14985 would fix my issue?
 I am not sure what unionRDD.scala has anything to do with my error (I don't
 know much about spark code base). Do I ever use unionRDD.scala when I call
 mapToPair or ReduceByKey or forEachRDD?  This error is very easy to
 reproduce you actually don't need to ingest any data to spark streaming
 job. Just have one simple transformation consists of mapToPair, reduceByKey
 and forEachRDD and have the window interval of 1min and batch interval of
 one one second and simple call ssc.awaitTermination() and watch the
 Thread Count go up significantly.

 I do think that using a fixed size executor service would probably be a
 safer approach. One could leverage ForJoinPool if they think they could
 benefit a lot from the work-steal algorithm and doubly ended queues in the
 ForkJoinPool.

 Thanks!




 On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen  wrote:

 Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?

 On Tue, Nov 1, 2016 at 2:11 AM kant kodali  wrote:

 Hi Ryan,

 I think you are right. This may not be related to the Receiver. I have
 attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
 have a window Interval of 1 minute (6ms) and batch interval of 1s (
 1000) This is generating lot of threads atleast 5 to 8 threads per
 second and the total number of threads is monotonically increasing. So just
 for tweaking purpose I changed my window interval to 1min (6ms) and
 batch interval of 10s (1) this looked lot better but still not
 ideal at very least it is not monotonic anymore (It goes up and down). Now
 my question  really is how do I tune such that my number of threads are
 optimal while satisfying the window Interval of 1 minute (6ms) and
 batch interval of 1s (1000) ?

 This jstack dump is taken after running my spark driver program for 2
 mins and there are about 1000 threads.

 Thanks!




>>
>


Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Hi Ayan,
After deployment, we might re-train it every month. That is whole different
problem I have explored yet. classification algorithm will for sure need to
Fit against new dataset to produce new model. Correct me if I am wrong but
I think I will also FIt new IDF model based on new dataset. At that time as
well I will follow same training-validation split (or corss-validation) to
evaluate model performance on new data before releasing it to make
prediction. So afik , every time you  need to re-train model you will need
to corss validate using some data split strategy.

I think spark ML document should start explaining mathematical model or
simple algorithm what Fit and Transform means for particular algorithm
(IDF, NaiveBayes)

Thanks

On Tue, Nov 1, 2016 at 5:45 AM, ayan guha  wrote:

> I have come across similar situation recently and decided to run Training
> workflow less frequently than scoring workflow.
>
> In your use case I would imagine you will run IDF fit workflow once in say
> a week. It will produce a model object which will be saved. In scoring
> workflow, you will typically see new unseen dataset and the model generated
> in training flow will be used to score or label this new dataset.
>
> Note, train and test datasets are used during development phase when you
> are trying to find out which model to use and efficientcy/performance/accuracy
> etc. It will never be part of workflow. In a little elaborate setting you
> may want to automate model evaluations, but that's a different story.
>
> Not sure if I could explain properly, please feel free to comment.
> On 1 Nov 2016 22:54, "Nirav Patel"  wrote:
>
>> Yes, I do apply NaiveBayes after IDF .
>>
>> " you can re-train (fit) on all your data before applying it to unseen
>> data." Did you mean I can reuse that model to Transform both training and
>> test data?
>>
>> Here's the process:
>>
>> Datasets:
>>
>>1. Full sample data (labeled)
>>2. Training (labeled)
>>3. Test (labeled)
>>4. Unseen (non-labeled)
>>
>> Here are two workflow options I see:
>>
>> Option - 1 (currently using)
>>
>>1. Fit IDF model (idf-1) on full Sample data
>>2. Apply(Transform) idf-1 on full sample data
>>3. Split data set into Training and Test data
>>4. Fit ML model on Training data
>>5. Apply(Transform) model on Test data
>>6. Apply(Transform) idf-1 on Unseen data
>>7. Apply(Transform) model on Unseen data
>>
>> Option - 2
>>
>>1. Split sample data into Training and Test data
>>2. Fit IDF model (idf-1) only on training data
>>3. Apply(Transform) idf-1 on training data
>>4. Apply(Transform) idf-1 on test data
>>5. Fit ML model on Training data
>>6. Apply(Transform) model on Test data
>>7. Apply(Transform) idf-1 on Unseen data
>>8. Apply(Transform) model on Unseen data
>>
>> So you are suggesting Option-2 in this particular case, right?
>>
>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East 
>> wrote:
>>
>>> Fit it on training data to evaluate the model. You can either use that
>>> model to apply to unseen data or you can re-train (fit) on all your data
>>> before applying it to unseen data.
>>>
>>> fit and transform are 2 different things: fit creates a model, transform
>>> applies a model to data to create transformed output. If you are using your
>>> training data in a subsequent step (e.g. running logistic regression or
>>> some other machine learning algorithm) then you need to transform your
>>> training data using the IDF model before passing it through the next step.
>>>
>>> 
>>> ---
>>> Robin East
>>> *Spark GraphX in Action* Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>>
>>>
>>>
>>>
>>>
>>> On 1 Nov 2016, at 11:18, Nirav Patel  wrote:
>>>
>>> Just to re-iterate what you said, I should fit IDF model only on
>>> training data and then re-use it for both test data and then later on
>>> unseen data to make predictions.
>>>
>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East 
>>> wrote:
>>>
 The point of setting aside a portion of your data as a test set is to
 try and mimic applying your model to unseen data. If you fit your IDF model
 to all your data, any evaluation you perform on your test set is likely to
 over perform compared to ‘real’ unseen data. Effectively you would have
 overfit your model.
 
 ---
 Robin East
 *Spark GraphX in Action* Michael Malak and Robin East
 Manning Publications Co.
 http://www.manning.com/books/spark-graphx-in-action





 On 1 Nov 2016, at 10:15, Nirav Patel  wrote:

 FYI, I do reuse IDF model while making prediction against new 

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Shixiong(Ryan) Zhu
Yes, try 2.0.1!

On Tue, Nov 1, 2016 at 11:25 AM, kant kodali  wrote:

> AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0
>
> On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Dstream "Window" uses "union" to combine multiple RDDs in one window into
>> a single RDD.
>>
>> On Tue, Nov 1, 2016 at 2:59 AM kant kodali  wrote:
>>
>>> @Sean It looks like this problem can happen with other RDD's as well.
>>> Not just unionRDD
>>>
>>> On Tue, Nov 1, 2016 at 2:52 AM, kant kodali  wrote:
>>>
>>> Hi Sean,
>>>
>>> The comments seem very relevant although I am not sure if this pull
>>> request https://github.com/apache/spark/pull/14985 would fix my issue?
>>> I am not sure what unionRDD.scala has anything to do with my error (I don't
>>> know much about spark code base). Do I ever use unionRDD.scala when I call
>>> mapToPair or ReduceByKey or forEachRDD?  This error is very easy to
>>> reproduce you actually don't need to ingest any data to spark streaming
>>> job. Just have one simple transformation consists of mapToPair, reduceByKey
>>> and forEachRDD and have the window interval of 1min and batch interval of
>>> one one second and simple call ssc.awaitTermination() and watch the
>>> Thread Count go up significantly.
>>>
>>> I do think that using a fixed size executor service would probably be a
>>> safer approach. One could leverage ForJoinPool if they think they could
>>> benefit a lot from the work-steal algorithm and doubly ended queues in the
>>> ForkJoinPool.
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>> On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen  wrote:
>>>
>>> Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?
>>>
>>> On Tue, Nov 1, 2016 at 2:11 AM kant kodali  wrote:
>>>
>>> Hi Ryan,
>>>
>>> I think you are right. This may not be related to the Receiver. I have
>>> attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
>>> have a window Interval of 1 minute (6ms) and batch interval of 1s (
>>> 1000) This is generating lot of threads atleast 5 to 8 threads per
>>> second and the total number of threads is monotonically increasing. So just
>>> for tweaking purpose I changed my window interval to 1min (6ms) and
>>> batch interval of 10s (1) this looked lot better but still not
>>> ideal at very least it is not monotonic anymore (It goes up and down). Now
>>> my question  really is how do I tune such that my number of threads are
>>> optimal while satisfying the window Interval of 1 minute (6ms) and
>>> batch interval of 1s (1000) ?
>>>
>>> This jstack dump is taken after running my spark driver program for 2
>>> mins and there are about 1000 threads.
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>


Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0

On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu  wrote:

> Dstream "Window" uses "union" to combine multiple RDDs in one window into
> a single RDD.
>
> On Tue, Nov 1, 2016 at 2:59 AM kant kodali  wrote:
>
>> @Sean It looks like this problem can happen with other RDD's as well. Not
>> just unionRDD
>>
>> On Tue, Nov 1, 2016 at 2:52 AM, kant kodali  wrote:
>>
>> Hi Sean,
>>
>> The comments seem very relevant although I am not sure if this pull
>> request https://github.com/apache/spark/pull/14985 would fix my issue? I
>> am not sure what unionRDD.scala has anything to do with my error (I don't
>> know much about spark code base). Do I ever use unionRDD.scala when I call
>> mapToPair or ReduceByKey or forEachRDD?  This error is very easy to
>> reproduce you actually don't need to ingest any data to spark streaming
>> job. Just have one simple transformation consists of mapToPair, reduceByKey
>> and forEachRDD and have the window interval of 1min and batch interval of
>> one one second and simple call ssc.awaitTermination() and watch the
>> Thread Count go up significantly.
>>
>> I do think that using a fixed size executor service would probably be a
>> safer approach. One could leverage ForJoinPool if they think they could
>> benefit a lot from the work-steal algorithm and doubly ended queues in the
>> ForkJoinPool.
>>
>> Thanks!
>>
>>
>>
>>
>> On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen  wrote:
>>
>> Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?
>>
>> On Tue, Nov 1, 2016 at 2:11 AM kant kodali  wrote:
>>
>> Hi Ryan,
>>
>> I think you are right. This may not be related to the Receiver. I have
>> attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
>> have a window Interval of 1 minute (6ms) and batch interval of 1s (
>> 1000) This is generating lot of threads atleast 5 to 8 threads per
>> second and the total number of threads is monotonically increasing. So just
>> for tweaking purpose I changed my window interval to 1min (6ms) and
>> batch interval of 10s (1) this looked lot better but still not ideal
>> at very least it is not monotonic anymore (It goes up and down). Now my
>> question  really is how do I tune such that my number of threads are
>> optimal while satisfying the window Interval of 1 minute (6ms) and
>> batch interval of 1s (1000) ?
>>
>> This jstack dump is taken after running my spark driver program for 2
>> mins and there are about 1000 threads.
>>
>> Thanks!
>>
>>
>>
>>


Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Shixiong(Ryan) Zhu
Dstream "Window" uses "union" to combine multiple RDDs in one window into a
single RDD.
On Tue, Nov 1, 2016 at 2:59 AM kant kodali  wrote:

> @Sean It looks like this problem can happen with other RDD's as well. Not
> just unionRDD
>
> On Tue, Nov 1, 2016 at 2:52 AM, kant kodali  wrote:
>
> Hi Sean,
>
> The comments seem very relevant although I am not sure if this pull
> request https://github.com/apache/spark/pull/14985 would fix my issue? I
> am not sure what unionRDD.scala has anything to do with my error (I don't
> know much about spark code base). Do I ever use unionRDD.scala when I call
> mapToPair or ReduceByKey or forEachRDD?  This error is very easy to
> reproduce you actually don't need to ingest any data to spark streaming
> job. Just have one simple transformation consists of mapToPair, reduceByKey
> and forEachRDD and have the window interval of 1min and batch interval of
> one one second and simple call ssc.awaitTermination() and watch the
> Thread Count go up significantly.
>
> I do think that using a fixed size executor service would probably be a
> safer approach. One could leverage ForJoinPool if they think they could
> benefit a lot from the work-steal algorithm and doubly ended queues in the
> ForkJoinPool.
>
> Thanks!
>
>
>
>
> On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen  wrote:
>
> Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?
>
> On Tue, Nov 1, 2016 at 2:11 AM kant kodali  wrote:
>
> Hi Ryan,
>
> I think you are right. This may not be related to the Receiver. I have
> attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
> have a window Interval of 1 minute (6ms) and batch interval of 1s (
> 1000) This is generating lot of threads atleast 5 to 8 threads per second
> and the total number of threads is monotonically increasing. So just for
> tweaking purpose I changed my window interval to 1min (6ms) and batch
> interval of 10s (1) this looked lot better but still not ideal at
> very least it is not monotonic anymore (It goes up and down). Now my
> question  really is how do I tune such that my number of threads are
> optimal while satisfying the window Interval of 1 minute (6ms) and
> batch interval of 1s (1000) ?
>
> This jstack dump is taken after running my spark driver program for 2 mins
> and there are about 1000 threads.
>
> Thanks!
>
>
>
>


Re: Deep learning libraries for scala

2016-11-01 Thread Benjamin Kim
To add, I see that Databricks has been busy integrating deep learning more into 
their product and put out a new article about this.

https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html 


An interesting tidbit is at the bottom of the article mentioning TensorFrames.

https://github.com/databricks/tensorframes 


Seems like an interesting direction…

Cheers,
Ben


> On Oct 19, 2016, at 9:05 AM, janardhan shetty  wrote:
> 
> Agreed. But as it states deeper integration with (scala) is yet to be 
> developed. 
> Any thoughts on how to use tensorflow with scala ? Need to write wrappers I 
> think.
> 
> 
> On Oct 19, 2016 7:56 AM, "Benjamin Kim"  > wrote:
> On that note, here is an article that Databricks made regarding using 
> Tensorflow in conjunction with Spark.
> 
> https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html
>  
> 
> 
> Cheers,
> Ben
> 
> 
>> On Oct 19, 2016, at 3:09 AM, Gourav Sengupta > > wrote:
>> 
>> while using Deep Learning you might want to stay as close to tensorflow as 
>> possible. There is very less translation loss, you get to access stable, 
>> scalable and tested libraries from the best brains in the industry and as 
>> far as Scala goes, it helps a lot to think about using the language as a 
>> tool to access algorithms in this instance unless you want to start 
>> developing algorithms from grounds up ( and in which case you might not 
>> require any libraries at all).
>> 
>> On Sat, Oct 1, 2016 at 3:30 AM, janardhan shetty > > wrote:
>> Hi,
>> 
>> Are there any good libraries which can be used for scala deep learning 
>> models ?
>> How can we integrate tensorflow with scala ML ?
>> 
> 



Re: GraphFrame BFS

2016-11-01 Thread Denny Lee
You should be able to GraphX or GraphFrames subgraph to build up your
subgraph.  A good example for GraphFrames can be found at:
http://graphframes.github.io/user-guide.html#subgraphs.  HTH!

On Mon, Oct 10, 2016 at 9:32 PM cashinpj  wrote:

> Hello,
>
> I have a set of data representing various network connections.  Each vertex
> is represented by a single id, while the edges have  a source id,
> destination id, and a relationship (peer to peer, customer to provider, or
> provider to customer).  I am trying to create a sub graph build around a
> single source node following one type of edge as far as possible.
>
> For example:
> 1 2 p2p
> 2 3 p2p
> 2 3 c2p
>
> Following the p2p edges would give:
>
> 1 2 p2p
> 2 3 p2p
>
> I am pretty new to GraphX and GraphFrames, but was wondering if it is
> possible to get this behavior using the GraphFrames bfs() function or would
> it be better to modify the already existing Pregel implementation of bfs?
>
> Thank you for your time.
>
> Padraic
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphFrame-BFS-tp27876.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Application remains in WAITING state after Master election

2016-11-01 Thread Alexis Seigneurin
Hi,

I am running a Spark standalone cluster with 2 masters: one active, the
other in standby. An application is running on this cluster.

When the active master dies, the standby master becomes active and the
running application reconnects to the newly active master.

The only problem I see is the UI displays the application as being in
WAITING state. It runs fine, though.

[image: Inline image 1]

Any idea why the application state is not switching to RUNNING?

Thanks,
Alexis


Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
If you are using local mode then there is only one JVM. In Linux as below
mine looks like this

${SPARK_HOME}/bin/spark-submit \
--packages ${PACKAGES} \
--driver-memory 8G \
--num-executors 1 \
--executor-memory 8G \

*--master local[12] \*--conf "${SCHEDULER}" \
--conf "${EXTRAJAVAOPTIONS}" \
--jars ${JARS} \
--class "${FILE_NAME}" \
--conf "${SPARKUIPORT}" \
--conf "${SPARKDRIVERPORT}" \
--conf "${SPARKFILESERVERPORT}" \
--conf "${SPARKBLOCKMANAGERPORT}" \
--conf "${SPARKKRYOSERIALIZERBUFFERMAX}" \
${JAR_FILE}

These parameters are defined below

function default_settings {
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export SCHEDULER="spark.scheduler.mode=FAIR"
export
EXTRAJAVAOPTIONS="spark.executor.extraJavaOptions=-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps"
export
JARS="/home/hduser/jars/spark-streaming-kafka-assembly_2.11-1.6.1.jar"
export SPARKUIPORT="spark.ui.port=5"
export SPARKDRIVERPORT="spark.driver.port=54631"
export SPARKFILESERVERPORT="spark.fileserver.port=54731"
export SPARKBLOCKMANAGERPORT="spark.blockManager.port=54832"
export SPARKKRYOSERIALIZERBUFFERMAX="spark.kryoserializer.buffer.max=512"
}

and other jar files have passed through --jars. Note that ${JAR_FILE} in my
case is built through MVN or SBT

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 November 2016 at 14:02, Jan Botorek  wrote:

> Yes, exactly.
> My (testing) run script is:
>
> spark-submit --class com.infor.skyvault.tests.LinearRegressionTest
> --master local C:\_resources\spark-1.0-SNAPSHOT.jar
> -DtrainDataPath="/path/to/model/data"
>
>
>
>
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, November 1, 2016 2:51 PM
> *To:* Jan Botorek 
> *Cc:* Vinod Mangipudi ; user 
>
> *Subject:* Re: Add jar files on classpath when submitting tasks to Spark
>
>
>
> Are you submitting your job through spark-submit?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 1 November 2016 at 13:39, Jan Botorek  wrote:
>
> Hello,
>
> This approach unfortunately doesn’t work for job submission for me. It
> works in the shell, but not when submitted.
>
> I ensured the (only worker) node has desired directory.
>
>
>
> Neither specifying all jars as you suggested, neither using
> */path/to/jarfiles/** works.
>
>
>
> Could you verify, that using this settings you are able to submit jobs
> with according dependencies, please?
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, November 1, 2016 2:18 PM
> *To:* Vinod Mangipudi 
>
>
> *Cc:* user 
> *Subject:* Re: Add jar files on classpath when submitting tasks to Spark
>
>
>
> you can do that as long as every node has the directory referenced.
>
>
>
> For example
>
>
>
> spark.driver.extraClassPath  /home/hduser/jars/ojdbc6.jar:/
> home/hduser/jars/jconn4.jar
> spark.executor.extraClassPath/home/hduser/jars/ojdbc6.jar:/
> home/hduser/jars/jconn4.jar
>
>
>
> this will work as long as all nodes have that directory.
>
>
>
> The other alternative is to mount the shared directory as NFS mount across
> all the nodes and all the noses can read from that shared directory
>
>
>
> HTH
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of 

RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Yes, exactly.
My (testing) run script is:
spark-submit --class com.infor.skyvault.tests.LinearRegressionTest --master 
local C:\_resources\spark-1.0-SNAPSHOT.jar -DtrainDataPath="/path/to/model/data"



From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, November 1, 2016 2:51 PM
To: Jan Botorek 
Cc: Vinod Mangipudi ; user 
Subject: Re: Add jar files on classpath when submitting tasks to Spark

Are you submitting your job through spark-submit?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 1 November 2016 at 13:39, Jan Botorek 
> wrote:
Hello,
This approach unfortunately doesn’t work for job submission for me. It works in 
the shell, but not when submitted.
I ensured the (only worker) node has desired directory.

Neither specifying all jars as you suggested, neither using /path/to/jarfiles/* 
works.

Could you verify, that using this settings you are able to submit jobs with 
according dependencies, please?

From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com]
Sent: Tuesday, November 1, 2016 2:18 PM
To: Vinod Mangipudi >

Cc: user >
Subject: Re: Add jar files on classpath when submitting tasks to Spark

you can do that as long as every node has the directory referenced.

For example

spark.driver.extraClassPath  
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar
spark.executor.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar

this will work as long as all nodes have that directory.

The other alternative is to mount the shared directory as NFS mount across all 
the nodes and all the noses can read from that shared directory

HTH






Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 1 November 2016 at 13:04, Vinod Mangipudi 
> wrote:
unsubscribe

On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek 
> wrote:
Thank you for the reply.
I am aware of the parameters used when submitting the tasks (--jars is working 
for us).

But, isn’t there any way how to specify a location (directory) for jars „in 
global“ - in the spark-defaults.conf??


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Tuesday, November 1, 2016 1:49 PM
To: Jan Botorek >
Cc: user >
Subject: Re: Add jar files on classpath when submitting tasks to Spark


There are options to specify external jars in the form of --jars, 
--driver-classpath etc depending on spark version and cluster manager.. Please 
see spark documents for configuration sections and/or run spark submit help to 
see available options.
On 1 Nov 2016 23:13, "Jan Botorek" 
> wrote:
Hello,
I have a problem trying to add jar files to be available on classpath when 
submitting task to Spark.

In my spark-defaults.conf file I have configuration:
spark.driver.extraClassPath = path/to/folder/with/jars
all jars in the folder are available in SPARK-SHELL

The problem is that jars are not on the classpath for SPARK-MASTER; more 
precisely – when I submit any job that utilizes any jar from external folder, 
the java.lang.ClassNotFoundException is thrown.
Moving all external jars into the jars folder solves the situation, but we need 
to keep external files separatedly.

Thank you for any help
Best regards,
Jan





Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
Are you submitting your job through spark-submit?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 November 2016 at 13:39, Jan Botorek  wrote:

> Hello,
>
> This approach unfortunately doesn’t work for job submission for me. It
> works in the shell, but not when submitted.
>
> I ensured the (only worker) node has desired directory.
>
>
>
> Neither specifying all jars as you suggested, neither using
> */path/to/jarfiles/** works.
>
>
>
> Could you verify, that using this settings you are able to submit jobs
> with according dependencies, please?
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, November 1, 2016 2:18 PM
> *To:* Vinod Mangipudi 
>
> *Cc:* user 
> *Subject:* Re: Add jar files on classpath when submitting tasks to Spark
>
>
>
> you can do that as long as every node has the directory referenced.
>
>
>
> For example
>
>
>
> spark.driver.extraClassPath  /home/hduser/jars/ojdbc6.jar:/
> home/hduser/jars/jconn4.jar
> spark.executor.extraClassPath/home/hduser/jars/ojdbc6.jar:/
> home/hduser/jars/jconn4.jar
>
>
>
> this will work as long as all nodes have that directory.
>
>
>
> The other alternative is to mount the shared directory as NFS mount across
> all the nodes and all the noses can read from that shared directory
>
>
>
> HTH
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 1 November 2016 at 13:04, Vinod Mangipudi  wrote:
>
> unsubscribe
>
>
>
> On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek  wrote:
>
> Thank you for the reply.
>
> I am aware of the parameters used when submitting the tasks (--jars is
> working for us).
>
>
>
> But, isn’t there any way how to specify a location (directory) for jars
> „in global“ - in the spark-defaults.conf??
>
>
>
>
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Tuesday, November 1, 2016 1:49 PM
> *To:* Jan Botorek 
> *Cc:* user 
> *Subject:* Re: Add jar files on classpath when submitting tasks to Spark
>
>
>
> There are options to specify external jars in the form of --jars,
> --driver-classpath etc depending on spark version and cluster manager..
> Please see spark documents for configuration sections and/or run spark
> submit help to see available options.
>
> On 1 Nov 2016 23:13, "Jan Botorek"  wrote:
>
> Hello,
>
> I have a problem trying to add jar files to be available on classpath when
> submitting task to Spark.
>
>
>
> In my spark-defaults.conf file I have configuration:
>
> *spark.driver.extraClassPath = path/to/folder/with/jars*
>
> all jars in the folder are available in SPARK-SHELL
>
>
>
> The problem is that jars are not on the classpath for SPARK-MASTER; more
> precisely – when I submit any job that utilizes any jar from external
> folder, the* java.lang.ClassNotFoundException* is thrown.
>
> Moving all external jars into the *jars* folder solves the situation, but
> we need to keep external files separatedly.
>
>
>
> Thank you for any help
>
> Best regards,
>
> Jan
>
>
>
>
>


RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Hello,
This approach unfortunately doesn’t work for job submission for me. It works in 
the shell, but not when submitted.
I ensured the (only worker) node has desired directory.

Neither specifying all jars as you suggested, neither using /path/to/jarfiles/* 
works.

Could you verify, that using this settings you are able to submit jobs with 
according dependencies, please?

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, November 1, 2016 2:18 PM
To: Vinod Mangipudi 
Cc: user 
Subject: Re: Add jar files on classpath when submitting tasks to Spark

you can do that as long as every node has the directory referenced.

For example

spark.driver.extraClassPath  
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar
spark.executor.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar

this will work as long as all nodes have that directory.

The other alternative is to mount the shared directory as NFS mount across all 
the nodes and all the noses can read from that shared directory

HTH






Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 1 November 2016 at 13:04, Vinod Mangipudi 
> wrote:
unsubscribe

On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek 
> wrote:
Thank you for the reply.
I am aware of the parameters used when submitting the tasks (--jars is working 
for us).

But, isn’t there any way how to specify a location (directory) for jars „in 
global“ - in the spark-defaults.conf??


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Tuesday, November 1, 2016 1:49 PM
To: Jan Botorek >
Cc: user >
Subject: Re: Add jar files on classpath when submitting tasks to Spark


There are options to specify external jars in the form of --jars, 
--driver-classpath etc depending on spark version and cluster manager.. Please 
see spark documents for configuration sections and/or run spark submit help to 
see available options.
On 1 Nov 2016 23:13, "Jan Botorek" 
> wrote:
Hello,
I have a problem trying to add jar files to be available on classpath when 
submitting task to Spark.

In my spark-defaults.conf file I have configuration:
spark.driver.extraClassPath = path/to/folder/with/jars
all jars in the folder are available in SPARK-SHELL

The problem is that jars are not on the classpath for SPARK-MASTER; more 
precisely – when I submit any job that utilizes any jar from external folder, 
the java.lang.ClassNotFoundException is thrown.
Moving all external jars into the jars folder solves the situation, but we need 
to keep external files separatedly.

Thank you for any help
Best regards,
Jan




Re: Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-01 Thread Sean Owen
CrossValidator splits the data into k sets, and then trains k times,
holding out one subset for cross-validation each time. You are correct that
you should actually withhold an additional test set, before you use
CrossValidator, in order to get an unbiased estimate of the best model's
performance.

On Tue, Nov 1, 2016 at 12:10 PM Nirav Patel  wrote:

> I am running classification model. with normal training-test split I can
> check model accuracy and F1 score using MulticlassClassificationEvaluator.
> How can I do this with CrossValidation approach?
> Afaik, you Fit entire sample data in CrossValidator as you don't want to
> leave out any observation from either testing or training. But by doing so
> I don't have anymore unseen data on which I can run finalized model on. So
> is there a way I can get Accuracy and F1 score of a best model resulted
> from cross validation?
> Or should I still split sample data in to training and test before running
> cross validation against only training data? so later I can test it against
> test data.
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Mich Talebzadeh
you can do that as long as every node has the directory referenced.

For example

spark.driver.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar
spark.executor.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar

this will work as long as all nodes have that directory.

The other alternative is to mount the shared directory as NFS mount across
all the nodes and all the noses can read from that shared directory

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 November 2016 at 13:04, Vinod Mangipudi  wrote:

> unsubscribe
>
> On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek  wrote:
>
>> Thank you for the reply.
>>
>> I am aware of the parameters used when submitting the tasks (--jars is
>> working for us).
>>
>>
>>
>> But, isn’t there any way how to specify a location (directory) for jars
>> „in global“ - in the spark-defaults.conf??
>>
>>
>>
>>
>>
>> *From:* ayan guha [mailto:guha.a...@gmail.com]
>> *Sent:* Tuesday, November 1, 2016 1:49 PM
>> *To:* Jan Botorek 
>> *Cc:* user 
>> *Subject:* Re: Add jar files on classpath when submitting tasks to Spark
>>
>>
>>
>> There are options to specify external jars in the form of --jars,
>> --driver-classpath etc depending on spark version and cluster manager..
>> Please see spark documents for configuration sections and/or run spark
>> submit help to see available options.
>>
>> On 1 Nov 2016 23:13, "Jan Botorek"  wrote:
>>
>> Hello,
>>
>> I have a problem trying to add jar files to be available on classpath
>> when submitting task to Spark.
>>
>>
>>
>> In my spark-defaults.conf file I have configuration:
>>
>> *spark.driver.extraClassPath = path/to/folder/with/jars*
>>
>> all jars in the folder are available in SPARK-SHELL
>>
>>
>>
>> The problem is that jars are not on the classpath for SPARK-MASTER; more
>> precisely – when I submit any job that utilizes any jar from external
>> folder, the* java.lang.ClassNotFoundException* is thrown.
>>
>> Moving all external jars into the *jars* folder solves the situation,
>> but we need to keep external files separatedly.
>>
>>
>>
>> Thank you for any help
>>
>> Best regards,
>>
>> Jan
>>
>>
>


Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Vinod Mangipudi
unsubscribe

On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek  wrote:

> Thank you for the reply.
>
> I am aware of the parameters used when submitting the tasks (--jars is
> working for us).
>
>
>
> But, isn’t there any way how to specify a location (directory) for jars
> „in global“ - in the spark-defaults.conf??
>
>
>
>
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Tuesday, November 1, 2016 1:49 PM
> *To:* Jan Botorek 
> *Cc:* user 
> *Subject:* Re: Add jar files on classpath when submitting tasks to Spark
>
>
>
> There are options to specify external jars in the form of --jars,
> --driver-classpath etc depending on spark version and cluster manager..
> Please see spark documents for configuration sections and/or run spark
> submit help to see available options.
>
> On 1 Nov 2016 23:13, "Jan Botorek"  wrote:
>
> Hello,
>
> I have a problem trying to add jar files to be available on classpath when
> submitting task to Spark.
>
>
>
> In my spark-defaults.conf file I have configuration:
>
> *spark.driver.extraClassPath = path/to/folder/with/jars*
>
> all jars in the folder are available in SPARK-SHELL
>
>
>
> The problem is that jars are not on the classpath for SPARK-MASTER; more
> precisely – when I submit any job that utilizes any jar from external
> folder, the* java.lang.ClassNotFoundException* is thrown.
>
> Moving all external jars into the *jars* folder solves the situation, but
> we need to keep external files separatedly.
>
>
>
> Thank you for any help
>
> Best regards,
>
> Jan
>
>


RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Thank you for the reply.
I am aware of the parameters used when submitting the tasks (--jars is working 
for us).

But, isn’t there any way how to specify a location (directory) for jars „in 
global“ - in the spark-defaults.conf??


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Tuesday, November 1, 2016 1:49 PM
To: Jan Botorek 
Cc: user 
Subject: Re: Add jar files on classpath when submitting tasks to Spark


There are options to specify external jars in the form of --jars, 
--driver-classpath etc depending on spark version and cluster manager.. Please 
see spark documents for configuration sections and/or run spark submit help to 
see available options.
On 1 Nov 2016 23:13, "Jan Botorek" 
> wrote:
Hello,
I have a problem trying to add jar files to be available on classpath when 
submitting task to Spark.

In my spark-defaults.conf file I have configuration:
spark.driver.extraClassPath = path/to/folder/with/jars
all jars in the folder are available in SPARK-SHELL

The problem is that jars are not on the classpath for SPARK-MASTER; more 
precisely – when I submit any job that utilizes any jar from external folder, 
the java.lang.ClassNotFoundException is thrown.
Moving all external jars into the jars folder solves the situation, but we need 
to keep external files separatedly.

Thank you for any help
Best regards,
Jan


Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread ayan guha
There are options to specify external jars in the form of --jars,
--driver-classpath etc depending on spark version and cluster manager..
Please see spark documents for configuration sections and/or run spark
submit help to see available options.
On 1 Nov 2016 23:13, "Jan Botorek"  wrote:

> Hello,
>
> I have a problem trying to add jar files to be available on classpath when
> submitting task to Spark.
>
>
>
> In my spark-defaults.conf file I have configuration:
>
> *spark.driver.extraClassPath = path/to/folder/with/jars*
>
> all jars in the folder are available in SPARK-SHELL
>
>
>
> The problem is that jars are not on the classpath for SPARK-MASTER; more
> precisely – when I submit any job that utilizes any jar from external
> folder, the* java.lang.ClassNotFoundException* is thrown.
>
> Moving all external jars into the *jars* folder solves the situation, but
> we need to keep external files separatedly.
>
>
>
> Thank you for any help
>
> Best regards,
>
> Jan
>


Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
I have come across similar situation recently and decided to run Training
workflow less frequently than scoring workflow.

In your use case I would imagine you will run IDF fit workflow once in say
a week. It will produce a model object which will be saved. In scoring
workflow, you will typically see new unseen dataset and the model generated
in training flow will be used to score or label this new dataset.

Note, train and test datasets are used during development phase when you
are trying to find out which model to use and
efficientcy/performance/accuracy etc. It will never be part of workflow. In
a little elaborate setting you may want to automate model evaluations, but
that's a different story.

Not sure if I could explain properly, please feel free to comment.
On 1 Nov 2016 22:54, "Nirav Patel"  wrote:

> Yes, I do apply NaiveBayes after IDF .
>
> " you can re-train (fit) on all your data before applying it to unseen
> data." Did you mean I can reuse that model to Transform both training and
> test data?
>
> Here's the process:
>
> Datasets:
>
>1. Full sample data (labeled)
>2. Training (labeled)
>3. Test (labeled)
>4. Unseen (non-labeled)
>
> Here are two workflow options I see:
>
> Option - 1 (currently using)
>
>1. Fit IDF model (idf-1) on full Sample data
>2. Apply(Transform) idf-1 on full sample data
>3. Split data set into Training and Test data
>4. Fit ML model on Training data
>5. Apply(Transform) model on Test data
>6. Apply(Transform) idf-1 on Unseen data
>7. Apply(Transform) model on Unseen data
>
> Option - 2
>
>1. Split sample data into Training and Test data
>2. Fit IDF model (idf-1) only on training data
>3. Apply(Transform) idf-1 on training data
>4. Apply(Transform) idf-1 on test data
>5. Fit ML model on Training data
>6. Apply(Transform) model on Test data
>7. Apply(Transform) idf-1 on Unseen data
>8. Apply(Transform) model on Unseen data
>
> So you are suggesting Option-2 in this particular case, right?
>
> On Tue, Nov 1, 2016 at 4:24 AM, Robin East  wrote:
>
>> Fit it on training data to evaluate the model. You can either use that
>> model to apply to unseen data or you can re-train (fit) on all your data
>> before applying it to unseen data.
>>
>> fit and transform are 2 different things: fit creates a model, transform
>> applies a model to data to create transformed output. If you are using your
>> training data in a subsequent step (e.g. running logistic regression or
>> some other machine learning algorithm) then you need to transform your
>> training data using the IDF model before passing it through the next step.
>>
>> 
>> ---
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 1 Nov 2016, at 11:18, Nirav Patel  wrote:
>>
>> Just to re-iterate what you said, I should fit IDF model only on training
>> data and then re-use it for both test data and then later on unseen data to
>> make predictions.
>>
>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East 
>> wrote:
>>
>>> The point of setting aside a portion of your data as a test set is to
>>> try and mimic applying your model to unseen data. If you fit your IDF model
>>> to all your data, any evaluation you perform on your test set is likely to
>>> over perform compared to ‘real’ unseen data. Effectively you would have
>>> overfit your model.
>>> 
>>> ---
>>> Robin East
>>> *Spark GraphX in Action* Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>>
>>>
>>>
>>>
>>>
>>> On 1 Nov 2016, at 10:15, Nirav Patel  wrote:
>>>
>>> FYI, I do reuse IDF model while making prediction against new unlabeled
>>> data but not between training and test data while training a model.
>>>
>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel 
>>> wrote:
>>>
 I am using IDF estimator/model (TF-IDF) to convert text features into
 vectors. Currently, I fit IDF model on all sample data and then transform
 them. I read somewhere that I should split my data into training and test
 before fitting IDF model; Fit IDF only on training data and then use same
 transformer to transform training and test data.
 This raise more questions:
 1) Why would you do that? What exactly do IDF learn during fitting
 process that it can reuse to transform any new dataset. Perhaps idea is to
 keep same value for |D| and DF|t, D| while use new TF|t, D| ?
 2) If not then fitting and transforming seems redundant for IDF model

>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] 

Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Hello,
I have a problem trying to add jar files to be available on classpath when 
submitting task to Spark.

In my spark-defaults.conf file I have configuration:
spark.driver.extraClassPath = path/to/folder/with/jars
all jars in the folder are available in SPARK-SHELL

The problem is that jars are not on the classpath for SPARK-MASTER; more 
precisely - when I submit any job that utilizes any jar from external folder, 
the java.lang.ClassNotFoundException is thrown.
Moving all external jars into the jars folder solves the situation, but we need 
to keep external files separatedly.

Thank you for any help
Best regards,
Jan


Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-01 Thread Nirav Patel
I am running classification model. with normal training-test split I can
check model accuracy and F1 score using MulticlassClassificationEvaluator.
How can I do this with CrossValidation approach?
Afaik, you Fit entire sample data in CrossValidator as you don't want to
leave out any observation from either testing or training. But by doing so
I don't have anymore unseen data on which I can run finalized model on. So
is there a way I can get Accuracy and F1 score of a best model resulted
from cross validation?
Or should I still split sample data in to training and test before running
cross validation against only training data? so later I can test it against
test data.

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Yes, I do apply NaiveBayes after IDF .

" you can re-train (fit) on all your data before applying it to unseen
data." Did you mean I can reuse that model to Transform both training and
test data?

Here's the process:

Datasets:

   1. Full sample data (labeled)
   2. Training (labeled)
   3. Test (labeled)
   4. Unseen (non-labeled)

Here are two workflow options I see:

Option - 1 (currently using)

   1. Fit IDF model (idf-1) on full Sample data
   2. Apply(Transform) idf-1 on full sample data
   3. Split data set into Training and Test data
   4. Fit ML model on Training data
   5. Apply(Transform) model on Test data
   6. Apply(Transform) idf-1 on Unseen data
   7. Apply(Transform) model on Unseen data

Option - 2

   1. Split sample data into Training and Test data
   2. Fit IDF model (idf-1) only on training data
   3. Apply(Transform) idf-1 on training data
   4. Apply(Transform) idf-1 on test data
   5. Fit ML model on Training data
   6. Apply(Transform) model on Test data
   7. Apply(Transform) idf-1 on Unseen data
   8. Apply(Transform) model on Unseen data

So you are suggesting Option-2 in this particular case, right?

On Tue, Nov 1, 2016 at 4:24 AM, Robin East  wrote:

> Fit it on training data to evaluate the model. You can either use that
> model to apply to unseen data or you can re-train (fit) on all your data
> before applying it to unseen data.
>
> fit and transform are 2 different things: fit creates a model, transform
> applies a model to data to create transformed output. If you are using your
> training data in a subsequent step (e.g. running logistic regression or
> some other machine learning algorithm) then you need to transform your
> training data using the IDF model before passing it through the next step.
>
> 
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 1 Nov 2016, at 11:18, Nirav Patel  wrote:
>
> Just to re-iterate what you said, I should fit IDF model only on training
> data and then re-use it for both test data and then later on unseen data to
> make predictions.
>
> On Tue, Nov 1, 2016 at 3:49 AM, Robin East  wrote:
>
>> The point of setting aside a portion of your data as a test set is to try
>> and mimic applying your model to unseen data. If you fit your IDF model to
>> all your data, any evaluation you perform on your test set is likely to
>> over perform compared to ‘real’ unseen data. Effectively you would have
>> overfit your model.
>> 
>> ---
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 1 Nov 2016, at 10:15, Nirav Patel  wrote:
>>
>> FYI, I do reuse IDF model while making prediction against new unlabeled
>> data but not between training and test data while training a model.
>>
>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel 
>> wrote:
>>
>>> I am using IDF estimator/model (TF-IDF) to convert text features into
>>> vectors. Currently, I fit IDF model on all sample data and then transform
>>> them. I read somewhere that I should split my data into training and test
>>> before fitting IDF model; Fit IDF only on training data and then use same
>>> transformer to transform training and test data.
>>> This raise more questions:
>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>
>>
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>>
>>
>>
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 
>
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
Fit it on training data to evaluate the model. You can either use that model to 
apply to unseen data or you can re-train (fit) on all your data before applying 
it to unseen data.

fit and transform are 2 different things: fit creates a model, transform 
applies a model to data to create transformed output. If you are using your 
training data in a subsequent step (e.g. running logistic regression or some 
other machine learning algorithm) then you need to transform your training data 
using the IDF model before passing it through the next step.

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 1 Nov 2016, at 11:18, Nirav Patel  wrote:
> 
> Just to re-iterate what you said, I should fit IDF model only on training 
> data and then re-use it for both test data and then later on unseen data to 
> make predictions.
> 
> On Tue, Nov 1, 2016 at 3:49 AM, Robin East  > wrote:
> The point of setting aside a portion of your data as a test set is to try and 
> mimic applying your model to unseen data. If you fit your IDF model to all 
> your data, any evaluation you perform on your test set is likely to over 
> perform compared to ‘real’ unseen data. Effectively you would have overfit 
> your model.
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action 
> 
> 
> 
> 
> 
> 
>> On 1 Nov 2016, at 10:15, Nirav Patel > > wrote:
>> 
>> FYI, I do reuse IDF model while making prediction against new unlabeled data 
>> but not between training and test data while training a model. 
>> 
>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel > > wrote:
>> I am using IDF estimator/model (TF-IDF) to convert text features into 
>> vectors. Currently, I fit IDF model on all sample data and then transform 
>> them. I read somewhere that I should split my data into training and test 
>> before fitting IDF model; Fit IDF only on training data and then use same 
>> transformer to transform training and test data. 
>> This raise more questions:
>> 1) Why would you do that? What exactly do IDF learn during fitting process 
>> that it can reuse to transform any new dataset. Perhaps idea is to keep same 
>> value for |D| and DF|t, D| while use new TF|t, D| ?
>> 2) If not then fitting and transforming seems redundant for IDF model
>> 
>> 
>> 
>> 
>>  
>> 
>>     
>>    
>>       
>> 
> 
> 
> 
> 
>  
> 
>     
>    
>       
> 


Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Just to re-iterate what you said, I should fit IDF model only on training
data and then re-use it for both test data and then later on unseen data to
make predictions.

On Tue, Nov 1, 2016 at 3:49 AM, Robin East  wrote:

> The point of setting aside a portion of your data as a test set is to try
> and mimic applying your model to unseen data. If you fit your IDF model to
> all your data, any evaluation you perform on your test set is likely to
> over perform compared to ‘real’ unseen data. Effectively you would have
> overfit your model.
> 
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 1 Nov 2016, at 10:15, Nirav Patel  wrote:
>
> FYI, I do reuse IDF model while making prediction against new unlabeled
> data but not between training and test data while training a model.
>
> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel  wrote:
>
>> I am using IDF estimator/model (TF-IDF) to convert text features into
>> vectors. Currently, I fit IDF model on all sample data and then transform
>> them. I read somewhere that I should split my data into training and test
>> before fitting IDF model; Fit IDF only on training data and then use same
>> transformer to transform training and test data.
>> This raise more questions:
>> 1) Why would you do that? What exactly do IDF learn during fitting
>> process that it can reuse to transform any new dataset. Perhaps idea is to
>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>> 2) If not then fitting and transforming seems redundant for IDF model
>>
>
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 
>
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
it would be great if we establish this.

I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are
private to the session and are put in a hidden staging directory as below

/user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-47_319_5605745346163312826-10

and removed when the session ends or table is dropped

Not sure how Spark handles this.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 November 2016 at 10:50, Michael David Pedersen <
michael.d.peder...@googlemail.com> wrote:

> Thanks for the link, I hadn't come across this.
>
> According to https://forums.databricks.com/questions/400/what-is-the-diff
>> erence-between-registertemptable-a.html
>>
>> and I quote
>>
>> "registerTempTable()
>>
>> registerTempTable() creates an in-memory table that is scoped to the
>> cluster in which it was created. The data is stored using Hive's
>> highly-optimized, in-memory columnar format."
>>
> But then the last post in the thread corrects this, saying:
> "registerTempTable does not create a 'cached' in-memory table, but rather
> an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
> or a reference in Java".
>
> So - probably need to dig into the sources to get more clarity on this.
>
> Cheers,
> Michael
>


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Thanks for the link, I hadn't come across this.

According to https://forums.databricks.com/questions/400/what-is-the-
> difference-between-registertemptable-a.html
>
> and I quote
>
> "registerTempTable()
>
> registerTempTable() creates an in-memory table that is scoped to the
> cluster in which it was created. The data is stored using Hive's
> highly-optimized, in-memory columnar format."
>
But then the last post in the thread corrects this, saying:
"registerTempTable does not create a 'cached' in-memory table, but rather
an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
or a reference in Java".

So - probably need to dig into the sources to get more clarity on this.

Cheers,
Michael


Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
The point of setting aside a portion of your data as a test set is to try and 
mimic applying your model to unseen data. If you fit your IDF model to all your 
data, any evaluation you perform on your test set is likely to over perform 
compared to ‘real’ unseen data. Effectively you would have overfit your model.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 1 Nov 2016, at 10:15, Nirav Patel  wrote:
> 
> FYI, I do reuse IDF model while making prediction against new unlabeled data 
> but not between training and test data while training a model. 
> 
> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel  > wrote:
> I am using IDF estimator/model (TF-IDF) to convert text features into 
> vectors. Currently, I fit IDF model on all sample data and then transform 
> them. I read somewhere that I should split my data into training and test 
> before fitting IDF model; Fit IDF only on training data and then use same 
> transformer to transform training and test data. 
> This raise more questions:
> 1) Why would you do that? What exactly do IDF learn during fitting process 
> that it can reuse to transform any new dataset. Perhaps idea is to keep same 
> value for |D| and DF|t, D| while use new TF|t, D| ?
> 2) If not then fitting and transforming seems redundant for IDF model
> 
> 
> 
> 
>  
> 
>     
>    
>       
> 


Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
A bit of gray area here I am afraid, I was trying to experiment with it

According to
https://forums.databricks.com/questions/400/what-is-the-difference-between-registertemptable-a.html

and I quote

"registerTempTable()

registerTempTable() creates an in-memory table that is scoped to the
cluster in which it was created. The data is stored using Hive's
highly-optimized, in-memory columnar format."


So on the face of it tempTable is an in-memory table

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 November 2016 at 10:01, Michael David Pedersen <
michael.d.peder...@googlemail.com> wrote:

> Hi again Mich,
>
> "But the thing is that I don't explicitly cache the tempTables ..".
>>
>> I believe tempTable is created in-memory and is already cached
>>
>
> That surprises me since there is a sqlContext.cacheTable method to
> explicitly cache a table in memory. Or am I missing something? This could
> explain why I'm seeing somewhat worse performance than I'd expect.
>
> Cheers,
> Michael
>


Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
FYI, I do reuse IDF model while making prediction against new unlabeled
data but not between training and test data while training a model.

On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel  wrote:

> I am using IDF estimator/model (TF-IDF) to convert text features into
> vectors. Currently, I fit IDF model on all sample data and then transform
> them. I read somewhere that I should split my data into training and test
> before fitting IDF model; Fit IDF only on training data and then use same
> transformer to transform training and test data.
> This raise more questions:
> 1) Why would you do that? What exactly do IDF learn during fitting process
> that it can reuse to transform any new dataset. Perhaps idea is to keep
> same value for |D| and DF|t, D| while use new TF|t, D| ?
> 2) If not then fitting and transforming seems redundant for IDF model
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Is IDF model reusable

2016-11-01 Thread Nirav Patel
I am using IDF estimator/model (TF-IDF) to convert text features into
vectors. Currently, I fit IDF model on all sample data and then transform
them. I read somewhere that I should split my data into training and test
before fitting IDF model; Fit IDF only on training data and then use same
transformer to transform training and test data.
This raise more questions:
1) Why would you do that? What exactly do IDF learn during fitting process
that it can reuse to transform any new dataset. Perhaps idea is to keep
same value for |D| and DF|t, D| while use new TF|t, D| ?
2) If not then fitting and transforming seems redundant for IDF model

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Hi again Mich,

"But the thing is that I don't explicitly cache the tempTables ..".
>
> I believe tempTable is created in-memory and is already cached
>

That surprises me since there is a sqlContext.cacheTable method to
explicitly cache a table in memory. Or am I missing something? This could
explain why I'm seeing somewhat worse performance than I'd expect.

Cheers,
Michael


Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
@Sean It looks like this problem can happen with other RDD's as well. Not
just unionRDD

On Tue, Nov 1, 2016 at 2:52 AM, kant kodali  wrote:

> Hi Sean,
>
> The comments seem very relevant although I am not sure if this pull
> request https://github.com/apache/spark/pull/14985 would fix my issue? I
> am not sure what unionRDD.scala has anything to do with my error (I don't
> know much about spark code base). Do I ever use unionRDD.scala when I call
> mapToPair or ReduceByKey or forEachRDD?  This error is very easy to
> reproduce you actually don't need to ingest any data to spark streaming
> job. Just have one simple transformation consists of mapToPair, reduceByKey
> and forEachRDD and have the window interval of 1min and batch interval of
> one one second and simple call ssc.awaitTermination() and watch the
> Thread Count go up significantly.
>
> I do think that using a fixed size executor service would probably be a
> safer approach. One could leverage ForJoinPool if they think they could
> benefit a lot from the work-steal algorithm and doubly ended queues in the
> ForkJoinPool.
>
> Thanks!
>
>
>
>
> On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen  wrote:
>
>> Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?
>>
>> On Tue, Nov 1, 2016 at 2:11 AM kant kodali  wrote:
>>
>>> Hi Ryan,
>>>
>>> I think you are right. This may not be related to the Receiver. I have
>>> attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
>>> have a window Interval of 1 minute (6ms) and batch interval of 1s (
>>> 1000) This is generating lot of threads atleast 5 to 8 threads per
>>> second and the total number of threads is monotonically increasing. So just
>>> for tweaking purpose I changed my window interval to 1min (6ms) and
>>> batch interval of 10s (1) this looked lot better but still not
>>> ideal at very least it is not monotonic anymore (It goes up and down). Now
>>> my question  really is how do I tune such that my number of threads are
>>> optimal while satisfying the window Interval of 1 minute (6ms) and
>>> batch interval of 1s (1000) ?
>>>
>>> This jstack dump is taken after running my spark driver program for 2
>>> mins and there are about 1000 threads.
>>>
>>> Thanks!
>>>
>>
>


Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
Hi Sean,

The comments seem very relevant although I am not sure if this pull request
https://github.com/apache/spark/pull/14985 would fix my issue? I am not
sure what unionRDD.scala has anything to do with my error (I don't know
much about spark code base). Do I ever use unionRDD.scala when I call
mapToPair or ReduceByKey or forEachRDD?  This error is very easy to
reproduce you actually don't need to ingest any data to spark streaming
job. Just have one simple transformation consists of mapToPair, reduceByKey
and forEachRDD and have the window interval of 1min and batch interval of
one one second and simple call ssc.awaitTermination() and watch the Thread
Count go up significantly.

I do think that using a fixed size executor service would probably be a
safer approach. One could leverage ForJoinPool if they think they could
benefit a lot from the work-steal algorithm and doubly ended queues in the
ForkJoinPool.

Thanks!




On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen  wrote:

> Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?
>
> On Tue, Nov 1, 2016 at 2:11 AM kant kodali  wrote:
>
>> Hi Ryan,
>>
>> I think you are right. This may not be related to the Receiver. I have
>> attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
>> have a window Interval of 1 minute (6ms) and batch interval of 1s (
>> 1000) This is generating lot of threads atleast 5 to 8 threads per
>> second and the total number of threads is monotonically increasing. So just
>> for tweaking purpose I changed my window interval to 1min (6ms) and
>> batch interval of 10s (1) this looked lot better but still not ideal
>> at very least it is not monotonic anymore (It goes up and down). Now my
>> question  really is how do I tune such that my number of threads are
>> optimal while satisfying the window Interval of 1 minute (6ms) and
>> batch interval of 1s (1000) ?
>>
>> This jstack dump is taken after running my spark driver program for 2
>> mins and there are about 1000 threads.
>>
>> Thanks!
>>
>


Re: Python - Spark Cassandra Connector on DC/OS

2016-11-01 Thread Andrew Holway
Sorry: Spark 2.0.0

On Tue, Nov 1, 2016 at 10:04 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:

> Hello,
>
> I've been getting pretty serious with DC/OS which I guess could be
> described as a somewhat polished distribution of Mesos. I'm not sure how
> relevant DC/OS is to this problem.
>
> I am using this pyspark program to test the cassandra connection:
> http://bit.ly/2eWAfxm (github)
>
> I can that the df.printSchema() method is working ok but the df.show()
> method is breaking with this error:
>
> Traceback (most recent call last):
>   File "/mnt/mesos/sandbox/squeeze.py", line 28, in 
> df.show()
>   File "/opt/spark/dist/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
> line 287, in show
>   File "/opt/spark/dist/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 933, in __call__
>   File "/opt/spark/dist/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 63, in deco
>   File "/opt/spark/dist/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
> line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o33.showString.
>
> Full output to stdout and stderr. : http://bit.ly/2f80f9e (gist)
>
> Versions:
>
> Spark 2.0.1
> Python Version: 3.4.3 (default, Sep 14 2016, 12:36:27)
> [cqlsh 5.0.1 | Cassandra 2.2.8 | CQL spec 3.3.1 | Native protocol v4]
> DC/OS v.1.8.4
>
> Cheers,
>
> Andrew
>
> --
> Otter Networks UG
> http://otternetworks.de
> Gotenstraße 17
> 10829 Berlin
>



-- 
Otter Networks UG
http://otternetworks.de
Gotenstraße 17
10829 Berlin


Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Sean Owen
Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?

On Tue, Nov 1, 2016 at 2:11 AM kant kodali  wrote:

> Hi Ryan,
>
> I think you are right. This may not be related to the Receiver. I have
> attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
> have a window Interval of 1 minute (6ms) and batch interval of 1s (
> 1000) This is generating lot of threads atleast 5 to 8 threads per second
> and the total number of threads is monotonically increasing. So just for
> tweaking purpose I changed my window interval to 1min (6ms) and batch
> interval of 10s (1) this looked lot better but still not ideal at
> very least it is not monotonic anymore (It goes up and down). Now my
> question  really is how do I tune such that my number of threads are
> optimal while satisfying the window Interval of 1 minute (6ms) and
> batch interval of 1s (1000) ?
>
> This jstack dump is taken after running my spark driver program for 2 mins
> and there are about 1000 threads.
>
> Thanks!
>


Python - Spark Cassandra Connector on DC/OS

2016-11-01 Thread Andrew Holway
Hello,

I've been getting pretty serious with DC/OS which I guess could be
described as a somewhat polished distribution of Mesos. I'm not sure how
relevant DC/OS is to this problem.

I am using this pyspark program to test the cassandra connection:
http://bit.ly/2eWAfxm (github)

I can that the df.printSchema() method is working ok but the df.show()
method is breaking with this error:

Traceback (most recent call last):
  File "/mnt/mesos/sandbox/squeeze.py", line 28, in 
df.show()
  File "/opt/spark/dist/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
line 287, in show
  File
"/opt/spark/dist/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line
933, in __call__
  File "/opt/spark/dist/python/lib/pyspark.zip/pyspark/sql/utils.py", line
63, in deco
  File "/opt/spark/dist/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o33.showString.

Full output to stdout and stderr. : http://bit.ly/2f80f9e (gist)

Versions:

Spark 2.0.1
Python Version: 3.4.3 (default, Sep 14 2016, 12:36:27)
[cqlsh 5.0.1 | Cassandra 2.2.8 | CQL spec 3.3.1 | Native protocol v4]
DC/OS v.1.8.4

Cheers,

Andrew

-- 
Otter Networks UG
http://otternetworks.de
Gotenstraße 17
10829 Berlin


Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
This question looks very similar to mine but I don't see any answer.

http://markmail.org/message/kkxhi5jjtwyadzxt

On Mon, Oct 31, 2016 at 11:24 PM, kant kodali  wrote:

> Here is a UI of my thread dump.
>
> http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYv
> MTEvMS8tLWpzdGFja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNo
> X2ludGVydmFsXzFzLnR4dC0tNi0xNy00Ng==
>
> On Mon, Oct 31, 2016 at 7:10 PM, kant kodali  wrote:
>
>> Hi Ryan,
>>
>> I think you are right. This may not be related to the Receiver. I have
>> attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
>> have a window Interval of 1 minute (6ms) and batch interval of 1s (
>> 1000) This is generating lot of threads atleast 5 to 8 threads per
>> second and the total number of threads is monotonically increasing. So just
>> for tweaking purpose I changed my window interval to 1min (6ms) and
>> batch interval of 10s (1) this looked lot better but still not ideal
>> at very least it is not monotonic anymore (It goes up and down). Now my
>> question  really is how do I tune such that my number of threads are
>> optimal while satisfying the window Interval of 1 minute (6ms) and
>> batch interval of 1s (1000) ?
>>
>> This jstack dump is taken after running my spark driver program for 2
>> mins and there are about 1000 threads.
>>
>> Thanks!
>>
>>
>> On Mon, Oct 31, 2016 at 1:09 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> If there is some leaking threads, I think you should be able to see the
>>> number of threads is increasing. You can just dump threads after 1-2 hours.
>>>
>>> On Mon, Oct 31, 2016 at 12:59 PM, kant kodali 
>>> wrote:
>>>
 yes I can certainly use jstack but it requires 4 to 5 hours for me to
 reproduce the error so I can get back as early as possible.

 Thanks a lot!

 On Mon, Oct 31, 2016 at 12:41 PM, Shixiong(Ryan) Zhu <
 shixi...@databricks.com> wrote:

> Then it should not be a Receiver issue. Could you use `jstack` to find
> out the name of leaking threads?
>
> On Mon, Oct 31, 2016 at 12:35 PM, kant kodali 
> wrote:
>
>> Hi Ryan,
>>
>> It happens on the driver side and I am running on a client mode (not
>> the cluster mode).
>>
>> Thanks!
>>
>> On Mon, Oct 31, 2016 at 12:32 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Sorry, there is a typo in my previous email: this may **not** be
>>> the root cause if the leak threads are in the driver side.
>>>
>>> Does it happen in the driver or executors?
>>>
>>> On Mon, Oct 31, 2016 at 12:20 PM, kant kodali 
>>> wrote:
>>>
 Hi Ryan,

 Ahh My Receiver.onStop method is currently empty.

 1) I have a hard time seeing why the receiver would crash so many 
 times within a span of 4 to 5 hours but anyways I understand I should 
 still cleanup during OnStop.

 2) How do I clean up those threads? The documentation here 
 https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html 
 doesn't seem to have any method where I can clean up the threads 
 created during OnStart. any ideas?

 Thanks!


 On Mon, Oct 31, 2016 at 11:58 AM, Shixiong(Ryan) Zhu <
 shixi...@databricks.com> wrote:

> So in your code, each Receiver will start a new thread. Did you
> stop the receiver properly in `Receiver.onStop`? Otherwise, you may 
> leak
> threads after a receiver crashes and is restarted by Spark. However, 
> this
> may be the root cause since the leak threads are in the driver side. 
> Could
> you use `jstack` to check which types of threads are leaking?
>
> On Mon, Oct 31, 2016 at 11:50 AM, kant kodali 
> wrote:
>
>> I am also under the assumption that *onStart *function of the
>> Receiver is only called only once by Spark. please correct me if
>> I am wrong.
>>
>> On Mon, Oct 31, 2016 at 11:35 AM, kant kodali > > wrote:
>>
>>> My driver program runs a spark streaming job.  And it spawns a
>>> thread by itself only in the *onStart()* function below Other
>>> than that it doesn't spawn any other threads. It only calls 
>>> MapToPair,
>>> ReduceByKey, forEachRDD, Collect functions.
>>>
>>> public class NSQReceiver extends Receiver {
>>>
>>> private String topic="";
>>>
>>> public NSQReceiver(String topic) {
>>> super(StorageLevel.MEMORY_AND_DISK_2());
>>> this.topic = topic;
>>> }
>>>
>>> @Override

Addition of two SparseVector

2016-11-01 Thread Yan Facai
Hi, all.
How can I add a Vector to another one?


scala> val a = Vectors.sparse(20, Seq((1,1.0), (2,2.0)))
a: org.apache.spark.ml.linalg.Vector = (20,[1,2],[1.0,2.0])

scala> val b = Vectors.sparse(20, Seq((2,2.0), (3,3.0)))
b: org.apache.spark.ml.linalg.Vector = (20,[2,3],[2.0,3.0])

scala> a + b
:38: error: type mismatch;
 found   : org.apache.spark.ml.linalg.Vector
 required: String
   a + b
   ^


Spark Job Failed with FileNotFoundException

2016-11-01 Thread fanooos
I have a spark cluster consists of 5 nodes and I have a spark job that should
process some files from a directory and send its content to Kafka.

I am trying to submit the job using the following command

bin$ ./spark-submit --total-executor-cores 20 --executor-memory 5G --class
org.css.java.FileMigration.FileSparkMigrator --master
spark://spark-master:7077
/home/me/FileMigrator-0.1.1-jar-with-dependencies.jar /home/me/shared
kafka01,kafka02,kafka03,kafka04,kafka05 kafka_topic


The directory /home/me/shared is mounted on all the 5 nodes but when I
submit the job I got the following exception

java.io.FileNotFoundException: File file:/home/me/shared/input_1.txt does
not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



After some tries, I faced another weird behavior. When I submit the job
while the directory contains 1 or 2 files, the same exception is thrown on
the driver machine but the job continued and the files are processed
successfully. Once I add another file, the exception is thrown and the job
failed.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-Failed-with-FileNotFoundException-tp27980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
Here is a UI of my thread dump.

http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYvMTEvMS8tLWpzdG
Fja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNoX2ludGVydmFsXz
FzLnR4dC0tNi0xNy00Ng==

On Mon, Oct 31, 2016 at 7:10 PM, kant kodali  wrote:

> Hi Ryan,
>
> I think you are right. This may not be related to the Receiver. I have
> attached jstack dump here. I do a simple MapToPair and reduceByKey and  I
> have a window Interval of 1 minute (6ms) and batch interval of 1s (
> 1000) This is generating lot of threads atleast 5 to 8 threads per second
> and the total number of threads is monotonically increasing. So just for
> tweaking purpose I changed my window interval to 1min (6ms) and batch
> interval of 10s (1) this looked lot better but still not ideal at
> very least it is not monotonic anymore (It goes up and down). Now my
> question  really is how do I tune such that my number of threads are
> optimal while satisfying the window Interval of 1 minute (6ms) and
> batch interval of 1s (1000) ?
>
> This jstack dump is taken after running my spark driver program for 2 mins
> and there are about 1000 threads.
>
> Thanks!
>
>
> On Mon, Oct 31, 2016 at 1:09 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> If there is some leaking threads, I think you should be able to see the
>> number of threads is increasing. You can just dump threads after 1-2 hours.
>>
>> On Mon, Oct 31, 2016 at 12:59 PM, kant kodali  wrote:
>>
>>> yes I can certainly use jstack but it requires 4 to 5 hours for me to
>>> reproduce the error so I can get back as early as possible.
>>>
>>> Thanks a lot!
>>>
>>> On Mon, Oct 31, 2016 at 12:41 PM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 Then it should not be a Receiver issue. Could you use `jstack` to find
 out the name of leaking threads?

 On Mon, Oct 31, 2016 at 12:35 PM, kant kodali 
 wrote:

> Hi Ryan,
>
> It happens on the driver side and I am running on a client mode (not
> the cluster mode).
>
> Thanks!
>
> On Mon, Oct 31, 2016 at 12:32 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Sorry, there is a typo in my previous email: this may **not** be the
>> root cause if the leak threads are in the driver side.
>>
>> Does it happen in the driver or executors?
>>
>> On Mon, Oct 31, 2016 at 12:20 PM, kant kodali 
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> Ahh My Receiver.onStop method is currently empty.
>>>
>>> 1) I have a hard time seeing why the receiver would crash so many times 
>>> within a span of 4 to 5 hours but anyways I understand I should still 
>>> cleanup during OnStop.
>>>
>>> 2) How do I clean up those threads? The documentation here 
>>> https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html doesn't 
>>> seem to have any method where I can clean up the threads created during 
>>> OnStart. any ideas?
>>>
>>> Thanks!
>>>
>>>
>>> On Mon, Oct 31, 2016 at 11:58 AM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 So in your code, each Receiver will start a new thread. Did you
 stop the receiver properly in `Receiver.onStop`? Otherwise, you may 
 leak
 threads after a receiver crashes and is restarted by Spark. However, 
 this
 may be the root cause since the leak threads are in the driver side. 
 Could
 you use `jstack` to check which types of threads are leaking?

 On Mon, Oct 31, 2016 at 11:50 AM, kant kodali 
 wrote:

> I am also under the assumption that *onStart *function of the
> Receiver is only called only once by Spark. please correct me if
> I am wrong.
>
> On Mon, Oct 31, 2016 at 11:35 AM, kant kodali 
> wrote:
>
>> My driver program runs a spark streaming job.  And it spawns a
>> thread by itself only in the *onStart()* function below Other
>> than that it doesn't spawn any other threads. It only calls 
>> MapToPair,
>> ReduceByKey, forEachRDD, Collect functions.
>>
>> public class NSQReceiver extends Receiver {
>>
>> private String topic="";
>>
>> public NSQReceiver(String topic) {
>> super(StorageLevel.MEMORY_AND_DISK_2());
>> this.topic = topic;
>> }
>>
>> @Override
>> public void *onStart()* {
>> new Thread()  {
>> @Override public void run() {
>> receive();
>> }
>> }.start();
>> }
>>
>> }
>>
>>
>> Environment info:
>>

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-11-01 Thread kant kodali
Here is a UI of my thread dump.

http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYvMTEvMS8tLWpzdGFja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNoX2ludGVydmFsXzFzLnR4dC0tNi0xNy00Ng==



On Mon, Oct 31, 2016 at 10:32 PM, kant kodali  wrote:

> Hi Vadim,
>
> Thank you so much this was a very useful command. This conversation is
> going on here
>
> https://www.mail-archive.com/user@spark.apache.org/msg58656.html
>
> or you can just google "
>
> why spark driver program is creating so many threads? How can I limit this
> number?
> 
> "
>
> please take a look if you are interested.
>
> Thanks a lot!
>
> On Mon, Oct 31, 2016 at 8:14 AM, Vadim Semenov <
> vadim.seme...@datadoghq.com> wrote:
>
>> Have you tried to get number of threads in a running process using `cat
>> /proc//status` ?
>>
>> On Sun, Oct 30, 2016 at 11:04 PM, kant kodali  wrote:
>>
>>> yes I did run ps -ef | grep "app_name" and it is root.
>>>
>>>
>>>
>>> On Sun, Oct 30, 2016 at 8:00 PM, Chan Chor Pang 
>>> wrote:
>>>
 sorry, the UID

 On 10/31/16 11:59 AM, Chan Chor Pang wrote:

 actually if the max user processes is not the problem, i have no idea

 but i still suspecting the user,
 as the user who run spark-submit is not necessary the pid for the JVM
 process

 can u make sure when you "ps -ef | grep {your app id} " the PID is root?
 On 10/31/16 11:21 AM, kant kodali wrote:

 The java process is run by the root and it has the same config

 sudo -i

 ulimit -a

 core file size  (blocks, -c) 0
 data seg size   (kbytes, -d) unlimited
 scheduling priority (-e) 0
 file size   (blocks, -f) unlimited
 pending signals (-i) 120242
 max locked memory   (kbytes, -l) 64
 max memory size (kbytes, -m) unlimited
 open files  (-n) 1024
 pipe size(512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 real-time priority  (-r) 0
 stack size  (kbytes, -s) 8192
 cpu time   (seconds, -t) unlimited
 max user processes  (-u) 120242
 virtual memory  (kbytes, -v) unlimited
 file locks  (-x) unlimited



 On Sun, Oct 30, 2016 at 7:01 PM, Chan Chor Pang  wrote:

> I have the same Exception before and the problem fix after i change
> the nproc conf.
>
> > max user processes  (-u) 120242
> ↑this config does looks good.
> are u sure the user who run ulimit -a is the same user who run the
> Java process?
> depend on how u submit the job and your setting, spark job may execute
> by other user.
>
>
> On 10/31/16 10:38 AM, kant kodali wrote:
>
> when I did this
>
> cat /proc/sys/kernel/pid_max
>
> I got 32768
>
> On Sun, Oct 30, 2016 at 6:36 PM, kant kodali 
> wrote:
>
>> I believe for ubuntu it is unlimited but I am not 100% sure (I just
>> read somewhere online). I ran ulimit -a and this is what I get
>>
>> core file size  (blocks, -c) 0
>> data seg size   (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size   (blocks, -f) unlimited
>> pending signals (-i) 120242
>> max locked memory   (kbytes, -l) 64
>> max memory size (kbytes, -m) unlimited
>> open files  (-n) 1024
>> pipe size(512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority  (-r) 0
>> stack size  (kbytes, -s) 8192
>> cpu time   (seconds, -t) unlimited
>> max user processes  (-u) 120242
>> virtual memory  (kbytes, -v) unlimited
>> file locks  (-x) unlimited
>>
>> On Sun, Oct 30, 2016 at 6:15 PM, Chan Chor Pang <
>> chin...@indetail.co.jp> wrote:
>>
>>> not sure for ubuntu, but i think you can just create the file by
>>> yourself
>>> the syntax will be the same as /etc/security/limits.conf
>>>
>>> nproc.conf not only limit java process but all process by the same
>>> user
>>> so even the jvm process does nothing,  if the corresponding user is
>>> busy in other way
>>> the jvm process will still not able to create new thread.
>>>
>>> btw the default limit for centos is 1024
>>>
>>>
>>> On 10/31/16 9:51 AM, kant kodali wrote:
>>>
>>>
>>> On Sun, Oct 30, 2016 at 5:22 PM, Chan Chor Pang <
>>>