Re: Assigning a unique row ID

2017-04-07 Thread Subhash Sriram
Hi,

We use monotonically_increasing_id() as well, but just cache the table first 
like Ankur suggested. With that method, we get the same keys in all derived 
tables. 

Thanks,
Subhash

Sent from my iPhone

> On Apr 7, 2017, at 7:32 PM, Everett Anderson  wrote:
> 
> Hi,
> 
> Thanks, but that's using a random UUID. Certainly unlikely to have 
> collisions, but not guaranteed.
> 
> I'd rather prefer something like monotonically_increasing_id or RDD's 
> zipWithUniqueId but with better behavioral characteristics -- so they don't 
> surprise people when 2+ outputs derived from an original table end up not 
> having the same IDs for the same rows, anymore.
> 
> It seems like this would be possible under the covers, but would have the 
> performance penalty of needing to do perhaps a count() and then also a 
> checkpoint.
> 
> I was hoping there's a better way.
> 
> 
>> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith  wrote:
>> http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>> 
>> 
>>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson  
>>> wrote:
>>> Hi,
>>> 
>>> What's the best way to assign a truly unique row ID (rather than a hash) to 
>>> a DataFrame/Dataset?
>>> 
>>> I originally thought that functions.monotonically_increasing_id would do 
>>> this, but it seems to have a rather unfortunate property that if you add it 
>>> as a column to table A and then derive tables X, Y, Z and save those, the 
>>> row ID values in X, Y, and Z may end up different. I assume this is because 
>>> it delays the actual computation to the point where each of those tables is 
>>> computed.
>>> 
>> 
>> 
>> 
>> -- 
>> 
>> --
>> Thanks,
>> 
>> Tim
> 


Re: Assigning a unique row ID

2017-04-07 Thread Everett Anderson
Hi,

Thanks, but that's using a random UUID. Certainly unlikely to have
collisions, but not guaranteed.

I'd rather prefer something like monotonically_increasing_id or RDD's
zipWithUniqueId but with better behavioral characteristics -- so they don't
surprise people when 2+ outputs derived from an original table end up not
having the same IDs for the same rows, anymore.

It seems like this would be possible under the covers, but would have the
performance penalty of needing to do perhaps a count() and then also a
checkpoint.

I was hoping there's a better way.


On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith  wrote:

> http://stackoverflow.com/questions/37231616/add-a-new-
> column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>
>
> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson  > wrote:
>
>> Hi,
>>
>> What's the best way to assign a truly unique row ID (rather than a hash)
>> to a DataFrame/Dataset?
>>
>> I originally thought that functions.monotonically_increasing_id would do
>> this, but it seems to have a rather unfortunate property that if you add it
>> as a column to table A and then derive tables X, Y, Z and save those, the
>> row ID values in X, Y, and Z may end up different. I assume this is because
>> it delays the actual computation to the point where each of those tables is
>> computed.
>>
>>
>
>
> --
>
> --
> Thanks,
>
> Tim
>


Re: Assigning a unique row ID

2017-04-07 Thread Ankur Srivastava
You can use zipWithIndex or the approach Tim suggested or even the one you
are using but I believe the issue is that tableA is being materialized
every time you for the new transformations. Are you caching/persisting the
table A? If you do that you should not see this behavior.

Thanks
Ankur

On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith  wrote:

> http://stackoverflow.com/questions/37231616/add-a-new-
> column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>
>
> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson  > wrote:
>
>> Hi,
>>
>> What's the best way to assign a truly unique row ID (rather than a hash)
>> to a DataFrame/Dataset?
>>
>> I originally thought that functions.monotonically_increasing_id would do
>> this, but it seems to have a rather unfortunate property that if you add it
>> as a column to table A and then derive tables X, Y, Z and save those, the
>> row ID values in X, Y, and Z may end up different. I assume this is because
>> it delays the actual computation to the point where each of those tables is
>> computed.
>>
>>
>
>
> --
>
> --
> Thanks,
>
> Tim
>


Re: Assigning a unique row ID

2017-04-07 Thread Tim Smith
http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator


On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson 
wrote:

> Hi,
>
> What's the best way to assign a truly unique row ID (rather than a hash)
> to a DataFrame/Dataset?
>
> I originally thought that functions.monotonically_increasing_id would do
> this, but it seems to have a rather unfortunate property that if you add it
> as a column to table A and then derive tables X, Y, Z and save those, the
> row ID values in X, Y, and Z may end up different. I assume this is because
> it delays the actual computation to the point where each of those tables is
> computed.
>
>


-- 

--
Thanks,

Tim


Assigning a unique row ID

2017-04-07 Thread Everett Anderson
Hi,

What's the best way to assign a truly unique row ID (rather than a hash) to
a DataFrame/Dataset?

I originally thought that functions.monotonically_increasing_id would do
this, but it seems to have a rather unfortunate property that if you add it
as a column to table A and then derive tables X, Y, Z and save those, the
row ID values in X, Y, and Z may end up different. I assume this is because
it delays the actual computation to the point where each of those tables is
computed.


BucketedRandomProjectionLSHModel algorithm details

2017-04-07 Thread vvinton
Hi There,

Using spark-mllib_2.11-2.1.0. Facing issue that
BucketedRandomProjectionLSHModel.approxNearestNeighbors returns one result,
always.

Dataset looks like:

+++-++--+
|  id|   
features|kmeansCluster|predictionVectorFeatures|featuresInNewDimension|
+++-++--+
|1045|(16384,[196,11016...|0|(16384,[196],[0.2...|  [[0.0],
[0.0], [0...|
|1041|(16384,[4110,1065...|0|(16384,[196],[0.2...|  [[0.0],
[0.0], [-...|
+++-++--+
Execution code:

Dataset approximatedDS = (Dataset)
((BucketedRandomProjectionLSHModel)model)
.approxNearestNeighbors(dataset,
vectorToCalculateAgainst, numberOfResults,
false, MLFlowConstants.THEMES_PREDICTION_COLUMNS.distance.name());
Where:

numberOfResults = 2
vectorToCalculateAgainst = first vector in predictionVectorFeatures column
approximatedDS looks like follows:

+++-++--+--+
|  id|   
features|kmeansCluster|predictionVectorFeatures|featuresInNewDimension| 
distance|
+++-++--+--+
|1061|(16384,[196,11016...|1|(16384,[196],[0.2...|  [[0.0],
[0.0], [0...|0.8536603178950374|
+++-++--+--+
I have suspicion, that in LSH.scala

  // Compute threshold to get exact k elements.
  // TODO: SPARK-18409: Use approxQuantile to get the threshold
  val modelDatasetSortedByHash =
modelDataset.sort(hashDistCol).limit(numNearestNeighbors)
  val thresholdDataset = modelDatasetSortedByHash.select(max(hashDistCol))
  val hashThreshold = thresholdDataset.take(1).head.getDouble(0)

  // Filter the dataset where the hash value is less than the threshold.
  modelDataset.filter(hashDistCol <= hashThreshold)
}
last filter does wrong filtering, but may be wrong (do not know scala).

Can anyone help me understand how to make
BucketedRandomProjectionLSHModel.approxNearestNeighbors to return multiple
"nearest" vectors?

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/BucketedRandomProjectionLSHModel-algorithm-details-tp28578.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Structured streaming and writing output to Cassandra

2017-04-07 Thread shyla deshpande
Is anyone using structured streaming and writing the results to Cassandra
database in a production environment?

I do not think I have enough expertise to write a custom sink that can be
used in production environment. Please help!


Re: Apache Drill vs Spark SQL

2017-04-07 Thread Pierce Lamb
Hi Kant,

If you are interested in using Spark alongside a database to serve real
time queries, there are many options. Almost every popular database has
built some sort of connector to Spark. I've listed a majority of them and
tried to delineate them in some way in this StackOverflow answer:

http://stackoverflow.com/a/39753976/3723346

As an employee of SnappyData ,
I'm biased toward it's solution in which Spark and the database are deeply
integrated and run on the same JVM. But there are many options depending on
your needs.

I'm not sure if the above link also answers your second question, but there
are two graph databases listed that connect to Spark as well.

Hope this helps,

Pierce

On Thu, Apr 6, 2017 at 10:34 PM, kant kodali  wrote:

> Hi All,
>
> I am very impressed with the work done on Spark SQL however when I have to
> pick something to serve real time queries I am in a dilemma for the
> following reasons.
>
> 1. Even though Spark Sql has logical plans, physical plans and run time
> code generation and all that it still doesn't look like the tool to serve
> real time queries like we normally do from a database. I tend to think this
> is because the queries had to go through job submission first. I don't want
> to call this overhead or anything but this is what it seems to do.
> comparing this, on the other hand we have the data that we want to serve
> sitting on a database where we simply issue an SQL query and get the
> response back so for this use case what would be an appropriate tool? I
> tend to think its Drill but would like to hear if there are any interesting
> arguments.
>
> 2. I can see a case for Spark SQL such as queries that need to be
> expressed in a iterative fashion. For example doing a graph traversal such
> BFS, DFS or say even a simple pre order, in order , post order Traversals
> on a BST. All this will be very hard to express on a Declarative syntax
> like SQL. I also tend to think Ad-hoc distributed joins (By Ad-hoc I mean
> one is not certain about their query patterns) are also better off
> expressing it in map-reduce style than say SQL unless one know their query
> patterns well ahead such that the possibility of queries that require
> redistribution is so low. I am also sure there are plenty of other cases
> where Spark SQL will excel but I wanted to see what is good choice to
> simple serve the data?
>
> Any suggestions are appreciated.
>
> Thanks!
>
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
Definitely agree with gourav there. I wouldn't want jenkins to run my work
flow. Seems to me that you would only be using jenkins for its scheduling
capabilities

Yes you can run tests but you wouldn't want it to run your orchestration of
jobs

What happens if jenkijs goes down for any particular reason. How do you
have the conversation with your stakeholders that your pipeline is not
working and they don't have data because the build server is going through
an upgrade or going through an upgrade

However to be fair I understand what you are saying Steve if someone is in
a place where you only have access to jenkins and have to go through hoops
to setup:get access to new instances then engineers will do what they
always do, find ways to game the system to get their work done




On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta 
wrote:

> Hi Steve,
>
> Why would you ever do that? You are suggesting the use of a CI tool as a
> workflow and orchestration engine.
>
> Regards,
> Gourav Sengupta
>
> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran 
> wrote:
>
>> If you have Jenkins set up for some CI workflow, that can do scheduled
>> builds and tests. Works well if you can do some build test before even
>> submitting it to a remote cluster
>>
>> On 7 Apr 2017, at 10:15, Sam Elamin  wrote:
>>
>> Hi Shyla
>>
>> You have multiple options really some of which have been already listed
>> but let me try and clarify
>>
>> Assuming you have a spark application in a jar you have a variety of
>> options
>>
>> You have to have an existing spark cluster that is either running on EMR
>> or somewhere else.
>>
>> *Super simple / hacky*
>> Cron job on EC2 that calls a simple shell script that does a spart submit
>> to a Spark Cluster OR create or add step to an EMR cluster
>>
>> *More Elegant*
>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
>> do the above step but have scheduling and potential backfilling and error
>> handling(retries,alerts etc)
>>
>> AWS are coming out with glue  soon that
>> does some Spark jobs but I do not think its available worldwide just yet
>>
>> Hope I cleared things up
>>
>> Regards
>> Sam
>>
>>
>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi Shyla,
>>>
>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
 easiest way to do this. Thanks

>>>
>>>
>>
>>
>


Contributed to spark

2017-04-07 Thread Stephen Fletcher
I'd like to eventually contribute to spark, but I'm noticing since spark 2
the query planner is heavily used throughout Dataset code base. Are there
any sites I can go to that explain the technical details, more than just
from a high-level prospective


Re: reducebykey

2017-04-07 Thread Ankur Srivastava
Hi Stephen,

If you use aggregate functions or reduceGroup on KeyValueGroupedDataset it
behaves as reduceByKey on RDD.

Only if you use flatMapGroups and mapGroups  it behaves as groupByKey on
RDD and if you read the API documentation it warns of using the API.

Hope this helps.

Thanks
Ankur

On Fri, Apr 7, 2017 at 7:26 AM, Stephen Fletcher  wrote:

> Are there plans to add reduceByKey to dataframes, Since switching over to
> spark 2 I find myself increasing dissatisfied with the idea of converting
> dataframes to RDD to do procedural programming on grouped data(both from a
> ease of programming stance and performance stance). So I've been using
> Dataframe's experimental groupByKey and flatMapGroups which perform
> extremely well, I'm guessing because of the encoders, but the amount of
> data being transfers is a little excessive. Is there any plans to port
> reduceByKey ( and additionally a reduceByKeyleft and right)?
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Gourav Sengupta
Hi Steve,

Why would you ever do that? You are suggesting the use of a CI tool as a
workflow and orchestration engine.

Regards,
Gourav Sengupta

On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran 
wrote:

> If you have Jenkins set up for some CI workflow, that can do scheduled
> builds and tests. Works well if you can do some build test before even
> submitting it to a remote cluster
>
> On 7 Apr 2017, at 10:15, Sam Elamin  wrote:
>
> Hi Shyla
>
> You have multiple options really some of which have been already listed
> but let me try and clarify
>
> Assuming you have a spark application in a jar you have a variety of
> options
>
> You have to have an existing spark cluster that is either running on EMR
> or somewhere else.
>
> *Super simple / hacky*
> Cron job on EC2 that calls a simple shell script that does a spart submit
> to a Spark Cluster OR create or add step to an EMR cluster
>
> *More Elegant*
> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
> do the above step but have scheduling and potential backfilling and error
> handling(retries,alerts etc)
>
> AWS are coming out with glue  soon that
> does some Spark jobs but I do not think its available worldwide just yet
>
> Hope I cleared things up
>
> Regards
> Sam
>
>
> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta  > wrote:
>
>> Hi Shyla,
>>
>> why would you want to schedule a spark job in EC2 instead of EMR?
>>
>> Regards,
>> Gourav
>>
>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande > > wrote:
>>
>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>> easiest way to do this. Thanks
>>>
>>
>>
>
>


Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Jörn Franke
Maybe using ranger or sentry would be the better choice to intercept those 
calls?

> On 7. Apr 2017, at 16:32, Alvaro Brandon  wrote:
> 
> I was going through the SparkContext.textFile() and I was wondering at that 
> point does Spark communicates with HDFS. Since when you download Spark 
> binaries you also specify the Hadoop version you will use, I'm guessing it 
> has its own client that calls HDFS wherever you specify it in the 
> configuration files.
> 
> The goal is to instrument and log all the calls that Spark does to HDFS. 
> Which class or classes perform these operations?
> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Steve Loughran

On 7 Apr 2017, at 15:32, Alvaro Brandon 
> wrote:

I was going through the SparkContext.textFile() and I was wondering at that 
point does Spark communicates with HDFS. Since when you download Spark binaries 
you also specify the Hadoop version you will use, I'm guessing it has its own 
client that calls HDFS wherever you specify it in the configuration files.



it uses the hadoop-hdfs JAR in spark-assembly JAR or the lib dir under 
SPARK_HOME. Nobody would ever want to do their own HDFS client, not if you look 
at the bit of the code related to kerberos. webhdfs://, that you 
could, though it's not done here.


The goal is to instrument and log all the calls that Spark does to HDFS. Which 
class or classes perform these operations?



org.apache.hadoop.hdfs.DistributedFileSystem

Take a look at HTrace here: 
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Tracing.html






Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Steve Loughran
If you have Jenkins set up for some CI workflow, that can do scheduled builds 
and tests. Works well if you can do some build test before even submitting it 
to a remote cluster

On 7 Apr 2017, at 10:15, Sam Elamin 
> wrote:

Hi Shyla

You have multiple options really some of which have been already listed but let 
me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or 
somewhere else.

Super simple / hacky
Cron job on EC2 that calls a simple shell script that does a spart submit to a 
Spark Cluster OR create or add step to an EMR cluster

More Elegant
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do 
the above step but have scheduling and potential backfilling and error 
handling(retries,alerts etc)

AWS are coming out with glue soon that does some 
Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta 
> wrote:
Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande 
> wrote:
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the easiest 
way to do this. Thanks





small job runs out of memory using wholeTextFiles

2017-04-07 Thread Paul Tremblay
As part of my processing, I have the following code:

rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10)
rdd.count()

The s3 directory has about 8GB of data and 61,878 files. I am using Spark
2.1, and running it with 15 modes of m3.xlarge nodes on EMR.

The job fails with this error:

: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 35532 in stage 0.0 failed 4 times, most recent failure: Lost task
35532.3 in stage 0.0 (TID 35543,
ip-172-31-36-192.us-west-2.compute.internal, executor 6):
ExecutorLostFailure (executor 6 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
7.4 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.


I have run it dozens of times, increasing partitions, reducing the size of
my data set (the original is 60GB), and increasing the number of
partitions, but get the same error each time.

In contrast, if I run a simple:

rdd = sc.textFile("s3://paulhtremblay/noaa_tmp/")
rdd.coutn()

The job finishes in 15 minutes, even with just 3 nodes.

Thanks

-- 
Paul Henry Tremblay
Robert Half Technology


Does Spark uses its own HDFS client?

2017-04-07 Thread Alvaro Brandon
I was going through the SparkContext.textFile() and I was wondering at that
point does Spark communicates with HDFS. Since when you download Spark
binaries you also specify the Hadoop version you will use, I'm guessing it
has its own client that calls HDFS wherever you specify it in the
configuration files.

The goal is to instrument and log all the calls that Spark does to HDFS.
Which class or classes perform these operations?


reducebykey

2017-04-07 Thread Stephen Fletcher
Are there plans to add reduceByKey to dataframes, Since switching over to
spark 2 I find myself increasing dissatisfied with the idea of converting
dataframes to RDD to do procedural programming on grouped data(both from a
ease of programming stance and performance stance). So I've been using
Dataframe's experimental groupByKey and flatMapGroups which perform
extremely well, I'm guessing because of the encoders, but the amount of
data being transfers is a little excessive. Is there any plans to port
reduceByKey ( and additionally a reduceByKeyleft and right)?


Cant convert Dataset to case class with Option fields

2017-04-07 Thread Dirceu Semighini Filho
Hi Devs,
I've some case classes here, and it's fields are all optional
case class A(b:Option[B] = None, c: Option[C] = None, ...)

If I read some data in a DataSet and try to connvert it to this case class
using the as method, it doesn't give me any answer, it simple freeze.
If I change the case class to

case class A(b:B,c:C)
id work nice and return the field values as null.

Option fields aren't supported by the as method or is this an Issue?

Kind Regards,
Dirceu


Re: reading snappy eventlog files from hdfs using spark

2017-04-07 Thread Jörn Franke
How do you read them?

> On 7. Apr 2017, at 12:11, Jacek Laskowski  wrote:
> 
> Hi, 
> 
> If your Spark app uses snappy in the code, define an appropriate library 
> dependency to have it on classpath. Don't rely on transitive dependencies. 
> 
> Jacek
> 
> On 7 Apr 2017 8:34 a.m., "satishl"  wrote:
> Hi, I am planning to process spark app eventlogs with another spark app.
> These event logs are saved with snappy compression (extension: .snappy).
> When i read the file in a new spark app - i get a snappy library not found
> error. I am confused as to how can spark write eventlog in snappy format
> without an error, but reading fails with the above error.
> 
> Any help in unblocking myself to read snappy eventlog files from hdfs using
> spark?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/reading-snappy-eventlog-files-from-hdfs-using-spark-tp28577.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 


Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
It's true that CrossValidator is not parallel currently - see
https://issues.apache.org/jira/browse/SPARK-19357 and feel free to help
review.

On Fri, 7 Apr 2017 at 14:18 Aseem Bansal  wrote:

>
>- Limited the data to 100,000 records.
>- 6 categorical feature which go through imputation, string indexing,
>one hot encoding. The maximum classes for the feature is 100. As data is
>imputated it becomes dense.
>- 1 numerical feature.
>- Training Logistic Regression through CrossValidation with grid to
>optimize its regularization parameter over the values 0.0001, 0.001, 0.005,
>0.01, 0.05, 0.1
>- Using spark's launcher api to launch it on a yarn cluster in Amazon
>AWS.
>
> I was thinking that as CrossValidator is finding the best parameters it
> should be able to run them independently. That sounds like something which
> could be ran in parallel.
>
>
> On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath 
> wrote:
>
> What is the size of training data (number examples, number features)?
> Dense or sparse features? How many classes?
>
> What commands are you using to submit your job via spark-submit?
>
> On Fri, 7 Apr 2017 at 13:12 Aseem Bansal  wrote:
>
> When using spark ml's LogisticRegression, RandomForest, CrossValidator
> etc. do we need to give any consideration while coding in making it scale
> with more CPUs or does it scale automatically?
>
> I am reading some data from S3, using a pipeline to train a model. I am
> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
> see much usage. It is running but I was expecting spark to use all RAM
> available and make it faster. So that's why I was thinking whether we need
> to take something particular in consideration or wrong expectations?
>
>
>


Re: Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
   - Limited the data to 100,000 records.
   - 6 categorical feature which go through imputation, string indexing,
   one hot encoding. The maximum classes for the feature is 100. As data is
   imputated it becomes dense.
   - 1 numerical feature.
   - Training Logistic Regression through CrossValidation with grid to
   optimize its regularization parameter over the values 0.0001, 0.001, 0.005,
   0.01, 0.05, 0.1
   - Using spark's launcher api to launch it on a yarn cluster in Amazon
   AWS.

I was thinking that as CrossValidator is finding the best parameters it
should be able to run them independently. That sounds like something which
could be ran in parallel.


On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath 
wrote:

> What is the size of training data (number examples, number features)?
> Dense or sparse features? How many classes?
>
> What commands are you using to submit your job via spark-submit?
>
> On Fri, 7 Apr 2017 at 13:12 Aseem Bansal  wrote:
>
>> When using spark ml's LogisticRegression, RandomForest, CrossValidator
>> etc. do we need to give any consideration while coding in making it scale
>> with more CPUs or does it scale automatically?
>>
>> I am reading some data from S3, using a pipeline to train a model. I am
>> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
>> see much usage. It is running but I was expecting spark to use all RAM
>> available and make it faster. So that's why I was thinking whether we need
>> to take something particular in consideration or wrong expectations?
>>
>


Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
What is the size of training data (number examples, number features)? Dense
or sparse features? How many classes?

What commands are you using to submit your job via spark-submit?

On Fri, 7 Apr 2017 at 13:12 Aseem Bansal  wrote:

> When using spark ml's LogisticRegression, RandomForest, CrossValidator
> etc. do we need to give any consideration while coding in making it scale
> with more CPUs or does it scale automatically?
>
> I am reading some data from S3, using a pipeline to train a model. I am
> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
> see much usage. It is running but I was expecting spark to use all RAM
> available and make it faster. So that's why I was thinking whether we need
> to take something particular in consideration or wrong expectations?
>


Re: distinct query getting stuck at ShuffleBlockFetcherIterator

2017-04-07 Thread Ramesh Krishnan
Hi Yash,
Thank you for the response.
Sorry it was not at distinct but it was at a join stage.
It was a self join. There were no errors and the jobs were stuck at the
step for a around 7 hrs, the last message that came through was .

*ShuffleBlockFetcherIterator: Started 4 remote fetches*
Thanks,
Ramesh

On Fri, Apr 7, 2017 at 10:05 AM, Yash Sharma  wrote:

> Hi Ramesh,
> Could you share some logs please? pastebin ? dag view ?
> Did you check for GC pauses if any.
>
> On Thu, 6 Apr 2017 at 21:55 Ramesh Krishnan 
> wrote:
>
>> I have a use case of distinct on a dataframe. When i run the application
>> is getting stuck at  LINE *ShuffleBlockFetcherIterator: Started 4 remote
>> fetches *forever.
>>
>> Can someone help .
>>
>>
>> Thanks
>> Ramesh
>>
>


Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
When using spark ml's LogisticRegression, RandomForest, CrossValidator etc.
do we need to give any consideration while coding in making it scale with
more CPUs or does it scale automatically?

I am reading some data from S3, using a pipeline to train a model. I am
running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
see much usage. It is running but I was expecting spark to use all RAM
available and make it faster. So that's why I was thinking whether we need
to take something particular in consideration or wrong expectations?


Is checkpointing in Spark Streaming Synchronous or Asynchronous ?

2017-04-07 Thread kant kodali
Hi All,

Is checkpointing in Spark Streaming Synchronous or Asynchronous ? other
words can spark continue processing the stream while checkpointing?

Thanks!


Re: Returning DataFrame for text file

2017-04-07 Thread Jacek Laskowski
Hi,

What's the alternative? Dataset? You've got textFile then.

It's an older API from the ages when Dataset was merely experimental.

Jacek

On 29 Mar 2017 8:58 p.m., "George Obama"  wrote:

> Hi,
>
> I saw that the API, either R or Scala, we are returning DataFrame for
> sparkSession.read.text()
>
> What’s the rational behind this?
>
> Regards,
> George
>


Re: reading snappy eventlog files from hdfs using spark

2017-04-07 Thread Jacek Laskowski
Hi,

If your Spark app uses snappy in the code, define an appropriate library
dependency to have it on classpath. Don't rely on transitive dependencies.

Jacek

On 7 Apr 2017 8:34 a.m., "satishl"  wrote:

Hi, I am planning to process spark app eventlogs with another spark app.
These event logs are saved with snappy compression (extension: .snappy).
When i read the file in a new spark app - i get a snappy library not found
error. I am confused as to how can spark write eventlog in snappy format
without an error, but reading fails with the above error.

Any help in unblocking myself to read snappy eventlog files from hdfs using
spark?



--
View this message in context: http://apache-spark-user-list.
1001560.n3.nabble.com/reading-snappy-eventlog-files-from-
hdfs-using-spark-tp28577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Error while reading the CSV

2017-04-07 Thread Praneeth Gayam
Try the following

spark-shell --master yarn-client  --name nayan  /opt/packages/-data-
prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar


On Thu, Apr 6, 2017 at 6:36 PM, nayan sharma 
wrote:

> Hi All,
> I am getting error while loading CSV file.
>
> val 
> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("timeline.csv")
> java.lang.NoSuchMethodError: org.apache.commons.csv.
> CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/
> commons/csv/CSVFormat;
>
>
> I have added the dependencies in sbt file
>
> // Spark Additional Library - CSV Read as DFlibraryDependencies += 
> "com.databricks" %% "spark-csv" % “1.5.0"
>
> *and starting the spark-shell with command*
>
> spark-shell --master yarn-client  --jars /opt/packages/-data-
> prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar --name
> nayan
>
>
>
> Thanks for any help!!
>
>
> Thanks,
> Nayan
>


Re: Hi

2017-04-07 Thread kant kodali
oops sorry. Please ignore this. wrong mailing list


Hi

2017-04-07 Thread kant kodali
Hi All,

I read the docs however I still have the following question For Stateful
stream processing is HDFS mandatory? because In some places I see it is
required and other places I see that rocksDB can be used. I just want to
know if HDFS is mandatory for Stateful stream processing?

Thanks!


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
Hi Shyla

You have multiple options really some of which have been already listed but
let me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or
somewhere else.

*Super simple / hacky*
Cron job on EC2 that calls a simple shell script that does a spart submit
to a Spark Cluster OR create or add step to an EMR cluster

*More Elegant*
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
do the above step but have scheduling and potential backfilling and error
handling(retries,alerts etc)

AWS are coming out with glue  soon that does
some Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta 
wrote:

> Hi Shyla,
>
> why would you want to schedule a spark job in EC2 instead of EMR?
>
> Regards,
> Gourav
>
> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande 
> wrote:
>
>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>> easiest way to do this. Thanks
>>
>
>


Re: Error while reading the CSV

2017-04-07 Thread Yash Sharma
Sorry buddy, didn't get your question quite right.
Just to test, I created a scala class with spark csv and it seemed to work.

Donno if that would help much, but here are the env details:
EMR 2.7.3
scalaVersion := "2.11.8"
Spark version 2.0.2



On Fri, 7 Apr 2017 at 17:51 nayan sharma  wrote:

> Hi Yash,
> I know this will work perfect but here I wanted  to read the csv using the
> assembly jar file.
>
> Thanks,
> Nayan
>
> On 07-Apr-2017, at 10:02 AM, Yash Sharma  wrote:
>
> Hi Nayan,
> I use the --packages with the spark shell and the spark submit. Could you
> please try that and let us know:
> Command:
>
> spark-submit --packages com.databricks:spark-csv_2.11:1.4.0
>
>
> On Fri, 7 Apr 2017 at 00:39 nayan sharma  wrote:
>
> spark version 1.6.2
> scala version 2.10.5
>
> On 06-Apr-2017, at 8:05 PM, Jörn Franke  wrote:
>
> And which version does your Spark cluster use?
>
> On 6. Apr 2017, at 16:11, nayan sharma  wrote:
>
> scalaVersion := “2.10.5"
>
>
>
>
>
> On 06-Apr-2017, at 7:35 PM, Jörn Franke  wrote:
>
> Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or
> the other way around?
>
> On 6. Apr 2017, at 15:54, nayan sharma  wrote:
>
> In addition I am using spark version 1.6.2
> Is there any chance of error coming because of Scala version or
> dependencies are not matching.?I just guessed.
>
> Thanks,
> Nayan
>
>
>
> On 06-Apr-2017, at 7:16 PM, nayan sharma  wrote:
>
> Hi Jorn,
> Thanks for replying.
>
> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
>
> after doing this I have found a lot of classes under
> com/databricks/spark/csv/
>
> do I need to check for any specific class ??
>
> Regards,
> Nayan
>
> On 06-Apr-2017, at 6:42 PM, Jörn Franke  wrote:
>
> Is the library in your assembly jar?
>
> On 6. Apr 2017, at 15:06, nayan sharma  wrote:
>
> Hi All,
> I am getting error while loading CSV file.
>
> val
> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("timeline.csv")
> java.lang.NoSuchMethodError:
> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
>
>
> I have added the dependencies in sbt file
>
> // Spark Additional Library - CSV Read as DFlibraryDependencies += 
> "com.databricks" %% "spark-csv" % “1.5.0"
>
> *and starting the spark-shell with command*
>
> spark-shell --master yarn-client  --jars
> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
> --name nayan
>
>
>
> Thanks for any help!!
>
>
> Thanks,
> Nayan
>
>
>
>
>
>
>


Re: Error while reading the CSV

2017-04-07 Thread nayan sharma
Hi Yash,
I know this will work perfect but here I wanted  to read the csv using the 
assembly jar file.

Thanks,
Nayan

> On 07-Apr-2017, at 10:02 AM, Yash Sharma  wrote:
> 
> Hi Nayan,
> I use the --packages with the spark shell and the spark submit. Could you 
> please try that and let us know:
> Command:
> spark-submit --packages com.databricks:spark-csv_2.11:1.4.0
> 
> On Fri, 7 Apr 2017 at 00:39 nayan sharma  > wrote:
> spark version 1.6.2
> scala version 2.10.5
> 
>> On 06-Apr-2017, at 8:05 PM, Jörn Franke > > wrote:
>> 
>> And which version does your Spark cluster use?
>> 
>> On 6. Apr 2017, at 16:11, nayan sharma > > wrote:
>> 
>>> scalaVersion := “2.10.5"
>>> 
>>> 
>>> 
>>> 
 On 06-Apr-2017, at 7:35 PM, Jörn Franke > wrote:
 
 Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or 
 the other way around?
 
 On 6. Apr 2017, at 15:54, nayan sharma > wrote:
 
> In addition I am using spark version 1.6.2
> Is there any chance of error coming because of Scala version or 
> dependencies are not matching.?I just guessed.
> 
> Thanks,
> Nayan
> 
>  
>> On 06-Apr-2017, at 7:16 PM, nayan sharma > > wrote:
>> 
>> Hi Jorn,
>> Thanks for replying.
>> 
>> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
>> 
>> after doing this I have found a lot of classes under 
>> com/databricks/spark/csv/
>> 
>> do I need to check for any specific class ??
>> 
>> Regards,
>> Nayan
>>> On 06-Apr-2017, at 6:42 PM, Jörn Franke >> > wrote:
>>> 
>>> Is the library in your assembly jar?
>>> 
>>> On 6. Apr 2017, at 15:06, nayan sharma >> > wrote:
>>> 
 Hi All,
 I am getting error while loading CSV file.
 
 val 
 datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
  "true").load("timeline.csv")
 java.lang.NoSuchMethodError: 
 org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
 
 
 I have added the dependencies in sbt file 
 // Spark Additional Library - CSV Read as DF
 libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
 and starting the spark-shell with command
 
 spark-shell --master yarn-client  --jars 
 /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
  --name nayan 
 
 
 
 Thanks for any help!!
 
 
 Thanks,
 Nayan
>> 
> 
>>> 
> 



reading snappy eventlog files from hdfs using spark

2017-04-07 Thread satishl
Hi, I am planning to process spark app eventlogs with another spark app.
These event logs are saved with snappy compression (extension: .snappy).
When i read the file in a new spark app - i get a snappy library not found
error. I am confused as to how can spark write eventlog in snappy format
without an error, but reading fails with the above error.

Any help in unblocking myself to read snappy eventlog files from hdfs using
spark?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reading-snappy-eventlog-files-from-hdfs-using-spark-tp28577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: is there a way to persist the lineages generated by spark?

2017-04-07 Thread kant kodali
yes Lineage that is actually replayable is what is needed for Validation
process. So we can address questions like how a system arrived at a state S
at a time T. I guess a good analogy is event sourcing.


On Thu, Apr 6, 2017 at 10:30 PM, Jörn Franke  wrote:

> I do think this is the right way, you will have to do testing with test
> data verifying that the expected output of the calculation is the output.
> Even if the logical Plan Is correct your calculation might not be. E.g.
> There can be bugs in Spark, in the UI or (what is very often) the client
> describes a calculation, but in the end the description is wrong.
>
> > On 4. Apr 2017, at 05:19, kant kodali  wrote:
> >
> > Hi All,
> >
> > I am wondering if there a way to persist the lineages generated by spark
> underneath? Some of our clients want us to prove if the result of the
> computation that we are showing on a dashboard is correct and for that If
> we can show the lineage of transformations that are executed to get to the
> result then that can be the Q.E.D moment but I am not even sure if this is
> even possible with spark?
> >
> > Thanks,
> > kant
>