Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Kevin Mellott
Can you please provide the high-level schema of the entities that you are attempting to join? I think that you may be able to use a more efficient technique to join these together; perhaps by registering the Dataframes as temp tables and constructing a Spark SQL query. Also, which version of

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Kevin Mellott
Lin - if you add "--verbose" to your original *spark-submit* command, it will let you know the location in which Spark is running. As Marcelo pointed out, this will likely indicate version 1.3, which may help you confirm if this is your problem. On Wed, Jan 13, 2016 at 12:06 PM, Marcelo Vanzin

Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-15 Thread Kevin Mellott
Hi George, I believe that sqlContext.cacheTable("tableName") is to be used when you want to cache the data that is being used within a Spark SQL query. For example, take a look at the code below. > val myData = sqlContext.load("com.databricks.spark.csv", Map("path" -> > "hdfs://somepath/file",

Re: Hive is unable to avro file written by spark avro

2016-01-13 Thread Kevin Mellott
Hi Sivakumar, I have run into this issue in the past, and we were able to fix it by using an explicit schema when saving the DataFrame to the Avro file. This schema was an exact match to the one associated with the metadata on the Hive database table, which allowed the Hive queries to work even

Re: Multi tenancy, REST and MLlib

2016-01-15 Thread Kevin Mellott
It sounds like you may be interested in a solution that implements the Lambda Architecture , such as Oryx2 . At a high level, this gives you the ability to request and receive information immediately (serving layer), generating

Re: Spark Distribution of Small Dataset

2016-01-28 Thread Kevin Mellott
Hi Phil, The short answer is that there is a driver machine (which handles the distribution of tasks and data) and a number of worker nodes (which receive data and perform the actual tasks). That being said, certain tasks need to be performed on the driver, because they require all of the data.

Re: unsubscribe email

2016-02-01 Thread Kevin Mellott
Take a look at the first section on http://spark.apache.org/community.html. You basically just need to send an email from the aliased email to user-unsubscr...@spark.apache.org. If you cannot log into that email directly, then I'd recommend using a mail client that allows for the "send-as"

Re: how to covert millisecond time to SQL timeStamp

2016-02-01 Thread Kevin Mellott
I've had pretty good success using Joda-Time for date/time manipulations within Spark applications. You may be able to use the *DateTIme* constructor below, if you are starting with milliseconds. DateTime public DateTime(long instant) Constructs an

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread Kevin Mellott
I believe that *show* should work if you provide it with both the number of rows and the truncate flag. ex: df.show(10, false) http://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show On Wed, Jan 27, 2016 at 2:39 AM, Akhil Das

Re: How to get progress information of an RDD operation

2016-02-23 Thread Kevin Mellott
Have you considered using the Spark Web UI to view progress on your job? It does a very good job showing the progress of the overall job, as well as allows you to drill into the individual tasks and server activity. On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) <

Re: Network Spark Streaming from multiple remote hosts

2016-02-23 Thread Kevin Mellott
Hi Vinti, That example is (in my opinion) more of a tutorial and not necessarily the way you'd want to set it up for a "real world" application. I'd recommend using something like Apache Kafka, which will allow the various hosts to publish messages to a queue. Your Spark Streaming application is

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Kevin Mellott
> On 25 February 2016 at 15:28, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Once you have loaded information into a DataFrame, you can use the >> *mapPartitionsi >> or forEachPartition *operations to both identify the partitions and >> operate

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Kevin Mellott
Once you have loaded information into a DataFrame, you can use the *mapPartitionsi or forEachPartition *operations to both identify the partitions and operate against them. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame On Thu, Feb 25, 2016 at 9:24 AM,

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
t; 0) 1.0 else -1.0 > } > > override private[mllib] def computeError(prediction: Double, label: > Double): Double = { > val err = label - prediction > math.abs(err) > } > } > > > On 29 February 2016 at 15:49, Kevin Mellott <kevin.r.mell...@gmail.com>

Flattening Data within DataFrames

2016-02-29 Thread Kevin Mellott
Fellow Sparkers, I'm trying to "flatten" my view of data within a DataFrame, and am having difficulties doing so. The DataFrame contains product information, which includes multiple levels of categories (primary, secondary, etc). *Example Data (Raw):* *NameLevel

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
nstructor is undefined. > > I'm using Spark 1.3.0, maybe it is not ready for this version? > > > > On 29 February 2016 at 15:38, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> I believe that you can instantiate an instance of the AbsoluteError class >>

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
You can use the constructor that accepts a BoostingStrategy object, which will allow you to set the tree strategy (and other hyperparameters as well). *GradientBoostedTrees

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
the Absolute Error in setLoss() function? > > > > > On 29 February 2016 at 15:26, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> You can use the constructor that accepts a BoostingStrategy object, which >> will allow you to set the tree s

Re: Flattening Data within DataFrames

2016-02-29 Thread Kevin Mellott
Thanks Michal - this is exactly what I need. On Mon, Feb 29, 2016 at 11:40 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > Hi Kevin, > > This should help: > > https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html > > On 29 Fe

Re: Using functional programming rather than SQL

2016-02-22 Thread Kevin Mellott
In your example, the *rs* instance should be a DataFrame object. In other words, the result of *HiveContext.sql* is a DataFrame that you can manipulate using *filter, map, *etc. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext On Mon, Feb 22, 2016

Re: Passing binding variable in query used in Data Source API

2016-01-21 Thread Kevin Mellott
Another alternative that you can consider is to use Sqoop to move your data from PostgreSQL to HDFS, and then just load it into your DataFrame without needing to use JDBC drivers. I've had success using this approach, and depending on your setup you can easily

Re: [Spark] Reading avro file in Spark 1.3.0

2016-01-25 Thread Kevin Mellott
I think that you may be looking at documentation pertaining to the more recent versions of Spark. Try looking at the examples linked below, which applies to the Spark 1.3 version. There aren't many Java examples, but the code should be very similar to the Scala ones (i.e. using "load" instead of

Re: trouble implementing complex transformer in java that can be used with Pipeline. Scala to Java porting problem

2016-01-20 Thread Kevin Mellott
Hi Andy, According to the API documentation for DataFrame , you should have access to *sqlContext* as a property off of the DataFrame instance. In your example, you could then do something like:

Re: RDD recomputation

2016-03-10 Thread Kevin Mellott
I've had very good success troubleshooting this type of thing by using the Spark Web UI, which will depict a breakdown of all tasks. This also includes the RDDs being used, as well as any cached data. Additional information about this tool can be found at

Re: How to convert Parquet file to a text file.

2016-03-15 Thread Kevin Mellott
I'd recommend reading the parquet file into a DataFrame object, and then using spark-csv to write to a CSV file. On Tue, Mar 15, 2016 at 3:34 PM, Shishir Anshuman wrote: > I need to convert the parquet file generated by the

Re: how to implement ALS with csv file? getting error while calling Rating class

2016-03-07 Thread Kevin Mellott
If you are using DataFrames, then you also can specify the schema when loading as an alternate solution. I've found Spark-CSV to be a very useful library when working with CSV data.

Re: Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

2016-05-03 Thread Kevin Mellott
If you put this into a dataframe then you may be able to use one hot encoding and treat these as categorical features. I believe that the ml pipeline components use project tungsten so the performance will be very fast. After you process the result on the dataframe you would then need to assemble

Re: pivot over non numerical data

2017-02-01 Thread Kevin Mellott
This should work for non-numerical data as well - can you please elaborate on the error you are getting and provide a code sample? As a preliminary hint, you can "aggregate" text values using *max*. df.groupBy("someCol") .pivot("anotherCol") .agg(max($"textCol")) Thanks, Kevin On Wed, Feb

Re: Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Kevin Mellott
You can run Spark code using the command line or by creating a JAR file (via IntelliJ or other IDE); however, you may wish to try a Databricks Community Edition account instead. They offer Spark as a managed service, and you can run Spark commands one at a time via interactive notebooks. There are

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > I'm attempting to implement a Spark Streaming application that will > consume application log messages from a message broker and store the > information in HDFS. During the data ingestion, we apply a cus

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
uot;) > > I tried these two commands. > write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",header="true") > > saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true") > > > > On Tue,

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
g those part files we are getting above error. > > Regards > > > On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Have you checked to see if any files already exist at >> /nfspartition/sankar/banking_l1_v2.csv? If so, you w

Re: unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found

2016-09-21 Thread Kevin Mellott
The "unresolved dependency" error is stating that the datastax dependency could not be located in the Maven repository. I believe that this should work if you change that portion of your command to the following. --packages com.datastax.spark:spark-cassandra-connector_2.10:2.0.0-M3 You can

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Kevin Mellott
Hi Anand, Unfortunately, there is not really a "one size fits all" answer to this question; however, here are some things that you may want to consider when trying different sizes. - What is the size of the data you are processing? - Whenever you invoke an action that requires ALL of the

Similar Items

2016-09-19 Thread Kevin Mellott
Hi all, I'm trying to write a Spark application that will detect similar items (in this case products) based on their descriptions. I've got an ML pipeline that transforms the product data to TF-IDF representation, using the following components. - *RegexTokenizer* - strips out non-word

Re: In Spark-scala, how to fill Vectors.dense in DataFrame from CSV?

2016-09-22 Thread Kevin Mellott
You'll want to use the spark-csv package, which is included in Spark 2.0. The repository documentation has some great usage examples. https://github.com/databricks/spark-csv Thanks, Kevin On Thu, Sep 22, 2016 at 8:40 PM, Dan Bikle wrote: > hello spark-world, > > I am new

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
one also > > On Sep 20, 2016 10:44 PM, "Kevin Mellott" <kevin.r.mell...@gmail.com> > wrote: > >> Instead of *mode="append"*, try *mode="overwrite"* >> >> On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally < >> sankar.m

Re: Similar Items

2016-09-20 Thread Kevin Mellott
looked at. > https://github.com/soundcloud/cosine-lsh-join-spark - not used this but > looks like it should do exactly what you need. > https://github.com/mrsqueeze/*spark*-hash > <https://github.com/mrsqueeze/spark-hash> > > > On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <

Re: Similar Items

2016-09-20 Thread Kevin Mellott
Using the Soundcloud implementation of LSH, I was able to process a 22K product dataset in a mere 65 seconds! Thanks so much for the help! On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Thanks Nick - those examples will help a ton!! > > On Tu

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Have you checked to see if any files already exist at /nfspartition/sankar/ banking_l1_v2.csv? If so, you will need to delete them before attempting to save your DataFrame to that location. Alternatively, you may be able to specify the "mode" setting of the df.write operation to "overwrite",

Re: study materials for operators on Dataframe

2016-09-19 Thread Kevin Mellott
I would recommend signing up for a Databricks Community Edition account. It will give you access to a 6GB cluster, with many different example programs that you can use to get started. https://databricks.com/try-databricks If you are looking for a more formal training method, I just completed

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Kevin Mellott
The documentation details the algorithm being used at http://spark.apache.org/docs/latest/mllib-decision-tree.html Thanks, Kevin On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty wrote: > Hi, > > Any help here is appreciated .. > > On Wed, Sep 28, 2016 at 11:34 AM,

Re: Dataframe Grouping - Sorting - Mapping

2016-09-30 Thread Kevin Mellott
When you perform a .groupBy, you need to perform an aggregate immediately afterwards. For example: val df1 = df.groupBy("colA").agg(sum(df1("colB"))) df1.show() More information and examples can be found in the documentation below.

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
il's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 10 October 2016 at 15:25, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Whilst w

Re: Nearest neighbour search

2016-11-14 Thread Kevin Mellott
You may be able to benefit from Soundcloud's open source implementation, either as a solution or as a reference implementation. https://github.com/soundcloud/cosine-lsh-join-spark Thanks, Kevin On Sun, Nov 13, 2016 at 2:07 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > That was

Re: Aggregated column name

2017-03-23 Thread Kevin Mellott
I'm not sure of the answer to your question; however, when performing aggregates I find it useful to specify an *alias* for each column. That will give you explicit control over the name of the resulting column. In your example, that would look something like:

Re: How to sleep Spark job

2019-01-22 Thread Kevin Mellott
I’d recommend using a scheduler of some kind to trigger your job each hour, and have the Spark job exit when it completes. Spark is not meant to run in any type of “sleep mode”, unless you want to run a structured streaming job and create a separate process to pull data from Casandra and publish

Re: Exception when reading multiline JSON file

2019-09-12 Thread Kevin Mellott
Hi Kumaresh, This is most likely an issue with the size of your Spark cluster not being large enough to accomplish the desired task. Hints for this type of situation are when the stack trace mentions things like a size limitation was exceeded and you lost a node. However, this is also a great