Dark mode logo

2024-03-06 Thread Mike Drob
Hi Spark Community, I see that y'all have a logo uploaded to https://www.apache.org/logos/#spark but it has black text. Is there an official, alternate logo with lighter text that would look good on a dark background? Thanks, Mike

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Mike Artz
Does anything different happened when you set the isolationLevel to do Dirty Reads i.e. "READ_UNCOMMITTED" On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H wrote: > Hi, > > We are writing a ETL pipeline using Spark, that fetch the data from SQL > server in batch mode (every 15mins). Problem

Re: unsubscribe

2019-11-26 Thread Mike Dillion
Nandan, Please send unsubscribe requests to user-unsubscr...@spark.apache.org On Tue, Nov 26, 2019 at 6:02 AM @Nandan@ wrote: > unsubscribe >

spark.sql.hive.exec.dynamic.partition description

2019-04-29 Thread Mike Chan
parallelism during read operations. Hope this makes sense. Thank you Best Regards, Mike

Fwd: autoBroadcastJoinThreshold not working as expected

2019-04-24 Thread Mike Chan
and failed with the same error. I even changed to 1MB and still the same. Appreciate if you can share any input. Thank you very much. Best Regards, MIke == Physical Plan == *(10) Project [product_key#445, store_key#446, fiscal_year#447, fiscal_month#448, fiscal_week_of_year#449, fiscal_year_week

Fwd: autoBroadcastJoinThreshold not working as expected

2019-04-23 Thread Mike Chan
and failed with the same error. I even changed to 1MB and still the same. Appreciate if you can share any input. Thank you very much. Best Regards, MIke == Physical Plan == *(10) Project [product_key#445, store_key#446, fiscal_year#447, fiscal_month#448, fiscal_week_of_year#449, fiscal_year_week

autoBroadcastJoinThreshold not working as expected

2019-04-18 Thread Mike Chan
and failed with the same error. I even changed to 1MB and still the same. Appreciate if you can share any input. Thank you very much. Best Regards, MIke == Physical Plan == *(10) Project [product_key#445, store_key#446, fiscal_year#447, fiscal_month#448, fiscal_week_of_year#449, fiscal_year_week

Question about Spark, Inner Join and Delegation to a Parquet Table

2018-07-02 Thread Mike Buck
I have a question about Spark and how it delegates filters to a Parquet-based table. I have two tables in Hive in Parquet format. Table1 has with four columns of type double and table2 has two columns of type double. I am doing an INNER JOIN of the following: SELECT table1.name FROM table1

Spark SQL within a DStream map function

2017-06-16 Thread Mike Hugo
in my map function? Or do you have any recommendations as to how I could set up a streaming job in a different way that would allow me to accept metadata on the stream of records coming in and pull each file down from s3 for processing? Thanks in advance for your help! Mike

Parquet Read Speed: Spark SQL vs Parquet MR

2017-06-03 Thread Mike Wheeler
d a) is faster than c) because a) is limited to sql query so Spark can do a lot things to optimize (such as not fully deserialize the objects). But I don't understand b) is much slower than c) because I assume both requires full deserialization. Is there anything I can try to improve b)? Thanks, Mike

Re: Convert camelCase to snake_case when saving Dataframe/Dataset to parquet?

2017-05-22 Thread Mike Wheeler
Cool. Thanks a lot in advance. On Mon, May 22, 2017 at 2:12 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Mike, > > I have code to do that. I'll share it tomorrow. > > Get Outlook for Android <https://aka.ms/ghei36> > > > > > On Mon, Ma

Convert camelCase to snake_case when saving Dataframe/Dataset to parquet?

2017-05-22 Thread Mike Wheeler
ype)? Thanks, Mike

Best Practice for Enum in Spark SQL

2017-05-11 Thread Mike Wheeler
=Car, 2=SUV, 3=Wagon)? 2) If I choose String, any penalty in hard drive space or memory? Thank you! Mike

Re: Schema Evolution for nested Dataset[T]

2017-05-02 Thread Mike Wheeler
++ | id|students|students.age| +---+++ | 20|[[c,20], [d,10]]|[20, 10]| | 10|[[a,null], [b,null]]|[null, null]| +---++----+ It creates a new column "students.age" instead of imputing the val

Schema Evolution for nested Dataset[T]

2017-04-30 Thread Mike Wheeler
m the old to the new. But it is kind of tedious. Any automatic methods? Thanks, Mike - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Combining reading from Kafka and HDFS w/ Spark Streaming

2017-03-01 Thread Mike Thomsen
) streamingContext.start streamingContext.awaitTermination It's not actually counting any of the files in the paths, and I know the paths are valid. Can someone tell me if this is possible and if so, give me a pointer on how to fix this? Thanks, Mike

Combining reading from Kafka and HDFS w/ Spark Streaming

2017-03-01 Thread Mike Thomsen
in the paths, and I know the paths are valid. Can someone tell me if this is possible and if so, give me a pointer on how to fix this? Thanks, Mike

RE: Best way to process lookup ETL with Dataframes

2017-01-04 Thread Sesterhenn, Mike
ght? Thanks, -Mike From: Nicholas Hakobian [mailto:nicholas.hakob...@rallyhealth.com] Sent: Friday, December 30, 2016 5:50 PM To: Sesterhenn, Mike Cc: ayan guha; user@spark.apache.org Subject: Re: Best way to process lookup ETL with Dataframes Yep, sequential joins is what I have done in the p

Re: Best way to process lookup ETL with Dataframes

2016-12-30 Thread Sesterhenn, Mike
data will result. Any other thoughts? From: Nicholas Hakobian <nicholas.hakob...@rallyhealth.com> Sent: Friday, December 30, 2016 2:12:40 PM To: Sesterhenn, Mike Cc: ayan guha; user@spark.apache.org Subject: Re: Best way to process lookup ETL with Data

Re: Best way to process lookup ETL with Dataframes

2016-12-30 Thread Sesterhenn, Mike
need is to join after the first join fails. From: ayan guha <guha.a...@gmail.com> Sent: Thursday, December 29, 2016 11:06 PM To: Sesterhenn, Mike Cc: user@spark.apache.org Subject: Re: Best way to process lookup ETL with Dataframes How about this -

Best way to process lookup ETL with Dataframes

2016-12-29 Thread Sesterhenn, Mike
Hi all, I'm writing an ETL process with Spark 1.5, and I was wondering the best way to do something. A lot of the fields I am processing require an algorithm similar to this: Join input dataframe to a lookup table. if (that lookup fails (the joined fields are null)) { Lookup into some

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mike Metzger
issues went away. I do not full understand Scala yet, but you may be able to set one of your dataframes to null to accomplish the same. Mike On Mon, Oct 17, 2016 at 8:38 PM, Mungeol Heo <mungeol@gmail.com> wrote: > First of all, Thank you for your comments. > Actually, What I

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mike Metzger
Hi Mich - Can you run a filter command on df1 prior to your map for any rows where p(3).toString != '-' then run your map command? Thanks Mike On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks guys > > Actually these are t

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Mike Metzger
redirect over the SSH session to the remote server. Note that any link that references a non-accessible IP address can't be reached (though you can also setup putty / SSH as a proxy to get around that if needed). Thanks Mike On Mon, Sep 19, 2016 at 4:43 AM, Cristina Rozee <rozee.crist...@gmai

Re: year out of range

2016-09-08 Thread Mike Metzger
p (the thing that usually bites me with CSV conversions). Assuming these all match to what you want, I'd try mapping the unparsed date column out to separate fields and try to see if a year field isn't matching the expected values. Thanks Mike On Thu, Sep 8, 2016 at 8:15 AM, Daniel Lopes <d

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Mike Metzger
/ guid which are generally unique across all entries assuming enough randomness. Think of the monotonically increasing id as an auto-incrementing column (with potentially massive gaps in ids) from a relational database. Thanks Mike On Sun, Sep 4, 2016 at 6:41 PM, Kevin Tran <kevin...@gmail.

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
ibly with a UDF, to keep things a little more clear. Thanks Mike On Fri, Aug 26, 2016 at 4:45 PM, Subhajit Purkayastha <spurk...@p3si.net> wrote: > So the data in the fcst dataframe is like this > > > > Product, fcst_qty > > A 100 > > B

Re: Please assist: Building Docker image containing spark 2.0

2016-08-26 Thread Mike Metzger
in futility. Thanks Mike On Fri, Aug 26, 2016 at 5:14 PM, Michael Gummelt <mgumm...@mesosphere.io> wrote: > Run with "-X -e" like the error message says. See what comes out. > > On Fri, Aug 26, 2016 at 2:23 PM, Tal Grynbaum <tal.grynb...@gmail.com> > wrote:

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
on the layout and what language you're using. Thanks Mike On Fri, Aug 26, 2016 at 3:29 PM, Subhajit Purkayastha <spurk...@p3si.net> wrote: > Mike, > > > > The grains of the dataFrame are different. > > > > I need to reduce the forecast qty (which is in the FCST DF)

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
Without seeing the makeup of the Dataframes nor what your logic is for updating them, I'd suggest doing a join of the Forecast DF with the appropriate columns from the SalesOrder DF. Mike On Fri, Aug 26, 2016 at 11:53 AM, Subhajit Purkayastha <spurk...@p3si.net> wrote: > I am using

Re: UDF on lpad

2016-08-25 Thread Mike Metzger
Is this what you're after? def padString(id: Int, chars: String, length: Int): String = chars * length + id.toString padString(123, "0", 10) res4: String = 00123 Mike On Thu, Aug 25, 2016 at 12:39 PM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Thank

Re: UDF on lpad

2016-08-25 Thread Mike Metzger
"$c%010.2f" // Result is 123.87 You can also do inline operations on the values before formatting. I've used this specifically to pad for hex digits from strings. val d = "100" val hexstring = f"0x${d.toInt}%08X" // hexstring is 0x0064 Thanks Mike On Thu,

Re: Sum array values by row in new column

2016-08-15 Thread Mike Metzger
Assuming you know the number of elements in the list, this should work: df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) + df["_1"].getItem(2)) Mike On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey <jre...@gmail.com> wrote: > Hi everyone, &

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
and referencing it that way so there's less cluster communication involved. Honestly I doubt there's a lot of variance with this small of a value but it's a good habit to get into. Thanks Mike On Fri, Aug 5, 2016 at 11:33 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Th

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
Should be pretty much the same code for Scala - import java.util.UUID UUID.randomUUID If you need it as a UDF, just wrap it accordingly. Mike On Fri, Aug 5, 2016 at 11:38 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > On the same token can one generate a UUID like belo

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
Not that I've seen, at least not in any worker independent way. To guarantee consecutive values you'd have to create a udf or some such that provided a new row id. This probably isn't an issue on small data sets but would cause a lot of added communication on larger clusters / datasets. Mike

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
for that. I've been toying with an implementation that allows you to specify the split for better control along with a start value. Thanks Mike > On Aug 5, 2016, at 11:07 AM, Tony Lane <tonylane@gmail.com> wrote: > > Mike. > > I have figured how to do this . Than

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
ns library. Mike > On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane@gmail.com> wrote: > > Ayan - basically i have a dataset with structure, where bid are unique string > values > > bid: String > val : integer > > I need unique int values for these string bid

Re: Add column sum as new column in PySpark dataframe

2016-08-04 Thread Mike Metzger
This is a little ugly, but it may do what you're after - df.withColumn('total', expr("+".join([col for col in df.columns]))) I believe this will handle null values ok, but will likely error if there are any string columns present. Mike On Thu, Aug 4, 2016 at 8:41 AM, Javie

Python memory included YARN-monitored memory?

2016-05-27 Thread Mike Sukmanowsky
things on one of the YARN nodes would seem to indicate this isn't the case since the spawned daemon gets a separate process ID and process group, but I wanted to check to confirm as it could make a big difference to pyspark users hoping to tune things. Thanks, Mike

RE: SparkR query

2016-05-17 Thread Mike Lewis
Rui [mailto:sunrise_...@163.com] Sent: 17 May 2016 11:32 To: Mike Lewis Cc: user@spark.apache.org Subject: Re: SparkR query Lewis, 1. Could you check the values of “SPARK_HOME” environment on all of your worker nodes? 2. How did you start your SparkR shell? On May 17, 2016, at 18:07, Mike Lewis

SparkR query

2016-05-17 Thread Mike Lewis
tory Is this a configuration setting that I’m missing, the worker nodes (linux) shouldn’t be looking in the spark home of the driver (windows) ? If so, I’d appreciate someone letting me know what I need to ch

Re: executor delay in Spark

2016-04-27 Thread Mike Hynes
analysis in you language of choice; hopefully it can aid in finer grained debugging (the headers of the fields it prints are listed in one of the functions). Mike On 4/25/16, Raghava Mutharaju <m.vijayaragh...@gmail.com> wrote: > Mike, > > We ran our program with 16, 32 an

Re: executor delay in Spark

2016-04-24 Thread Mike Hynes
, but if not and your executors all receive at least *some* partitions, then I still wouldn't rule out effects of scheduling delay. It's a simple test, but it could give some insight. Mike his could still be a scheduling If only one has *all* partitions, and email me the log file? (If it's 10+ MB

Re: executor delay in Spark

2016-04-22 Thread Mike Hynes
by unusual initial task scheduling. I don't know of ways to avoid this other than creating a dummy task to synchronize the executors, but hopefully someone from there can suggest other possibilities. Mike On Apr 23, 2016 5:53 AM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com>

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Mike Hynes
durations less than the initial executor delay. I recommend you look at your logs to verify if this is happening to you. Mike On 4/18/16, Anuj Kumar <anujs...@gmail.com> wrote: > Good point Mike +1 > > On Mon, Apr 18, 2016 at 9:47 AM, Mike Hynes <91m...@gmail.com> wrote: >

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Mike Hynes
3. Make the tasks longer, i.e. with some silly computational work. Mike On 4/17/16, Raghava Mutharaju <m.vijayaragh...@gmail.com> wrote: > Yes its the same data. > > 1) The number of partitions are the same (8, which is an argument to the > HashPartitioner). In the first case,

Re: RDD Partitions not distributed evenly to executors

2016-04-06 Thread Mike Hynes
alf the number of partitions with the shuffle flag set to true. Would that be reasonable? Thank you very much for your time, and I very much hope that someone from the dev community who is familiar with the scheduler may be able to clarify the above observations and questions. Thanks, Mike P.S. Ko

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
f anyone else has any other ideas or experience, please let me know. Mike On 4/4/16, Koert Kuipers <ko...@tresata.com> wrote: > we ran into similar issues and it seems related to the new memory > management. can you try: > spark.memory.useLegacyMode = true > > On Mo

RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
make a previously (<=1.5) correct configuration go haywire? Have new configuration settings been added of which I'm unaware that could lead to this problem? Please let me know if others in the community have observed this, and thank yo

Re: Spark Metrics Framework?

2016-04-01 Thread Mike Sukmanowsky
Thanks Silvio, JIRA submitted https://issues.apache.org/jira/browse/SPARK-14332. On Fri, 25 Mar 2016 at 12:46 Silvio Fiorito <silvio.fior...@granturing.com> wrote: > Hi Mike, > > Sorry got swamped with work and didn’t get a chance to reply. > > I misunderstood what you

Re: Spark Metrics Framework?

2016-03-25 Thread Mike Sukmanowsky
Pinging again - any thoughts? On Wed, 23 Mar 2016 at 09:17 Mike Sukmanowsky <mike.sukmanow...@gmail.com> wrote: > Thanks Ted and Silvio. I think I'll need a bit more hand holding here, > sorry. The way we use ES Hadoop is in pyspark via > org.elasticsearch.hadoop.mr

Re: Spark Metrics Framework?

2016-03-23 Thread Mike Sukmanowsky
o.fior...@granturing.com> wrote: > Hi Mike, > > It’s been a while since I worked on a custom Source but I think all you > need to do is make your Source in the org.apache.spark package. > > Thanks, > Silvio > > From: Mike Sukmanowsky <mike.sukmanow...@gmail.com&

Re: Spark Metrics Framework?

2016-03-22 Thread Mike Sukmanowsky
se the metric sources and sinks described here: > http://spark.apache.org/docs/latest/monitoring.html#metrics > > If you want to push the metrics to another system you can define a custom > sink. You can also extend the metrics by defining a custom source. > > From: Mike Sukmanow

Spark Metrics Framework?

2016-03-21 Thread Mike Sukmanowsky
cs framework similar to Hadoop's Counter system to Spark or is there an alternative means for us to grab metrics exposed when using Hadoop APIs to load/save RDDs? Thanks, Mike

Spark Job Server with Yarn and Kerberos

2016-01-04 Thread Mike Wright
Has anyone used Spark Job Server on a "kerberized" cluster in YARN-Client mode? When Job Server contacts the YARN resource manager, we see a "Cannot impersonate root" error and am not sure what we have misconfigured. Thanks. ___ *Mike Wright* Principal

Re: Questions on Kerberos usage with YARN and JDBC

2015-12-13 Thread Mike Wright
Kerberos seems to be working otherwise ... for example, we're using it successfully to control access to HDFS and it's linked to AD ... we're using Ranger if that helps. I'm not a systems admin guy so this is really not my area of expertise. ___ *Mike Wright* Principal Architect

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Mike Wright
Thanks for the insight! ___ *Mike Wright* Principal Architect, Software Engineering S Capital IQ and SNL 434-951-7816 *p* 434-244-4466 *f* 540-470-0119 *m* mwri...@snl.com On Fri, Dec 11, 2015 at 2:38 PM, Michael Armbrust <mich...@databricks.com> wrote: > The way t

Questions on Kerberos usage with YARN and JDBC

2015-12-11 Thread Mike Wright
As part of our implementation, we are utilizing a full "Kerberized" cluster built on the Hortonworks suite. We're using Job Server as the front end to initiate short-run jobs directly from our client-facing product suite. 1) We believe we have configured the job server to start with the

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Mike Wright
Somewhat related - What's the correct implementation when you have a single cluster to support multiple jobs that are unrelated and NOT sharing data? I was directed to figure out, via job server, to support "multiple contexts" and explained that multiple contexts per JVM is not really supported.

Re: New to Spark - Paritioning Question

2015-09-08 Thread Mike Wright
at the end and writing them all at once. I am using a groupBy against the filtered RDD the get the grouping I want, but apparently this may not be the most efficient way, and it seems that everything is always in a single partition under this scenario. ___ *Mike Wright* Principal Architect

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Mike Trienis
idtest/id goals goaltest/goal /goals /execution /executions /plugin /plugins On Tue, Aug 25, 2015 at 2:10 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hello

How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Mike Trienis
org.apache.spark.sql.hive.test.TestHiveContext However, it suffers from the same memory issue. Has anyone else suffered from the same problem? Note that I am running these unit tests on my mac. Cheers, Mike.

PySpark concurrent jobs using single SparkContext

2015-08-20 Thread Mike Sukmanowsky
a single PySpark context? -- Mike Sukmanowsky Aspiring Digital Carpenter *e*: mike.sukmanow...@gmail.com LinkedIn http://www.linkedin.com/profile/view?id=10897143 | github https://github.com/msukmanowsky

Spark SQL window functions (RowsBetween)

2015-08-20 Thread Mike Trienis
with: col2 | col3 --- item_1 | 2 item_2 | 2 Thanks, Mike.

Optimal way to implement a small lookup table for identifiers in an RDD

2015-08-10 Thread Mike Trienis
(entity: Entity): EntityExtended = { val id = entity.identifier // lookup identifier in broadcast variable? } } Thanks, Mike.

control the number of reducers for groupby in data frame

2015-08-04 Thread Fang, Mike
Hi, Does anyone know how I could control the number of reducer when we do operation such as groupie For data frame? I could set spark.sql.shuffle.partitions in sql but not sure how to do in df.groupBy(XX) api. Thanks, Mike

streamingContext.stop(true,true) doesn't end the job

2015-07-29 Thread mike
doesn't finish. All I see in the log is this : Can someone point me , to what I'm doing wrong ? Thanks, Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streamingContext-stop-true-true-doesn-t-end-the-job-tp24064.html Sent from the Apache Spark User

Re: Data frames select and where clause dependency

2015-07-20 Thread Mike Trienis
Definitely, thanks Mohammed. On Mon, Jul 20, 2015 at 5:47 PM, Mohammed Guller moham...@glassbeam.com wrote: Thanks, Harish. Mike – this would be a cleaner version for your use case: df.filter(df(filter_field) === value).select(field1).show() Mohammed *From:* Harish Butani

[General Question] [Hadoop + Spark at scale] Spark Rack Awareness ?

2015-07-18 Thread Mike Frampton
I wanted to ask a general question about Hadoop/Yarn and Apache Spark integration. I know that Hadoop on a physical cluster has rack awareness. i.e. It attempts to minimise network traffic by saving replicated blocks within a rack. i.e. I wondered whether, when Spark is configured to use

Data frames select and where clause dependency

2015-07-17 Thread Mike Trienis
= value); - df.select(field1).filter(df(filter_field) === value).show() As a work-around, it seems that I can do the following - df.select(field1, filter_field).filter(df(filter_field) === value).drop(filter_field).show() Thanks, Mike.

Splitting dataframe using Spark 1.4 for nested json input

2015-07-04 Thread Mike Tracy
transformations. Regards Mike

Error with splitting contents of a dataframe column using Spark 1.4 for nested complex json file

2015-07-01 Thread Mike Tracy
: value split is not a member of Nothing df1.explode(mv,mvnew)(mv = mv.split(,)) Am i doing something wrong? I need to extract data under mi.mv in separate columns so i can apply some transformations. Regards Mike

[Spark 1.3.1] Spark HiveQL - CDH 5.3 Hive 0.13 UDF's

2015-06-26 Thread Mike Frampton
aware that I can play with scala and get around this issue and I have but I wondered whether others have come across this and solved it ? cheers Mike F

Aggregating metrics using Cassandra and Spark streaming

2015-06-24 Thread Mike Trienis
. Has any one else run into this problem before and how did you solve it? Thanks, Mike. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: understanding on the waiting batches and scheduling delay in Streaming UI

2015-06-22 Thread Fang, Mike
Hi Das, Thanks for your reply. Somehow I missed it.. I am using Spark 1.3. The data source is from kafka. Yeah, not sure why the delay is 0. I'll run against 1.4 and give a screenshot. Thanks, Mike From: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com Date: Thursday, June

[Spark 1.3.1 SQL] Using Hive

2015-06-21 Thread Mike Frampton
Hi Is it true that if I want to use Spark SQL ( for Spark 1.3.1 ) against Apache Hive I need to build a source version of Spark ? Im using CDH 5.3 on CentOS Linux 6.5 which uses Hive 0.13.0 ( I think ). cheers Mike F

understanding on the waiting batches and scheduling delay in Streaming UI

2015-06-17 Thread Mike Fang
? If this is the case, the scheduling delay should be high rather than 0. Am I missing anything? Thanks, Mike

questions on the waiting batches and scheduling delay in Streaming UI

2015-06-16 Thread Fang, Mike
? If this is the case, the scheduling delay should be high rather than 0. Am I missing anything? Thanks, Mike

spark stream twitter question ..

2015-06-13 Thread Mike Frampton
() dfHashTags.registerTempTable(tweets) } // extra stuff here ssc.start() ssc.awaitTermination() } // end main } // end twitter1 cheers Mike F

Re: Managing spark processes via supervisord

2015-06-05 Thread Mike Trienis
/30672648/how-to-autostart-an-apache-spark-cluster-using-supervisord/30676844#30676844 Cheers Mike On Wed, Jun 3, 2015 at 12:29 PM, Igor Berman igor.ber...@gmail.com wrote: assuming you are talking about standalone cluster imho, with workers you won't get any problems and it's straightforward

Managing spark processes via supervisord

2015-06-03 Thread Mike Trienis
the cluster. I am considering using supervisord to control all the processes (worker, master, ect.. ) in order to have the cluster up an running after boot-up; although I'd like to understand if it will cause more issues than it solves. Thanks, Mike.

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-23 Thread Mike Trienis
core, an executor is simply a jvm instance and as such it can be granted any number of cores and ram So check how many cores you have per executor Sent from Samsung Mobile Original message From: Mike Trienis Date:2015/05/22 21:51 (GMT+00:00) To: user@spark.apache.org

Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
) result.foreachPartition { i = i.foreach(record = connection.insert(record)) } } def doSomething(rdd: RDD[Data]) : RDD[MyObject] = { rdd.flatMap(MyObject) } Any ideas as to how to improve the throughput? Thanks, Mike.

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
I guess each receiver occupies a executor. So there was only one executor available for processing the job. On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I have cluster of four nodes (three workers and one master, with one core each) which consumes data

Spark sql and csv data processing question

2015-05-15 Thread Mike Frampton
Hi Im getting the following error when trying to process a csv based data file. Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 4 times, most recent failure: Lost task 1.3 in stage 10.0 (TID 262,

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
Hey Chris! I was happy to see the documentation outlining that issue :-) However, I must have got into a pretty terrible state because I had to delete and recreate the kinesis streams as well as the DynamoDB tables. Thanks for the reply, everything is sorted. Mike On Fri, May 8, 2015 at 7

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
. If you see errors, you may need to manually delete the DynamoDB table.* On Fri, May 8, 2015 at 2:06 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I am submitting the assembled fat jar file by the command: bin/spark-submit --jars /spark-streaming-kinesis-asl_2.10-1.3.0.jar

Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
to be outputting the data. Has anyone else encountered a similar issue? Does spark cache the stream name somewhere? I also have checkpointing enabled as well. Thanks, Mike.

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Mike Trienis
with no success :( Would be curious to know if you got it working. Vadim On Apr 13, 2015, at 9:36 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I have having trouble building a fat jar file through sbt-assembly. [warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename' [warn

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Mike Trienis
-assembly-0.1-SNAPSHOT.jar Thanks again Richard! Cheers Mike. On Tue, Apr 14, 2015 at 11:01 AM, Richard Marscher rmarsc...@localytics.com wrote: Hi, I've gotten an application working with sbt-assembly and spark, thought I'd present an option. In my experience, trying to bundle any

sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
as a *provided *dependency? Also, is there a merge strategy I need apply? Any help would be appreciated, Mike.

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
got it working. Vadim On Apr 13, 2015, at 9:36 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I have having trouble building a fat jar file through sbt-assembly. [warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename' [warn] Merging 'META-INF/NOTICE' with strategy 'rename

Re: MLlib : Gradient Boosted Trees classification confidence

2015-04-13 Thread mike
' Michael On Mon, Apr 13, 2015 at 10:13 AM, pprett [via Apache Spark User List] ml-node+s1001560n22470...@n3.nabble.com wrote: Hi Mike, Gradient Boosted Trees (or gradient boosted regression trees) dont store probabilities in each leaf node but rather model a continuous function which

Re: Cannot run unit test.

2015-04-08 Thread Mike Trienis
It's because your tests are running in parallel and you can only have one context running at a time. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-unit-test-tp14459p22429.html Sent from the Apache Spark User List mailing list archive at

Re: Spark Streaming S3 Performance Implications

2015-04-01 Thread Mike Trienis
on a proper deployment, and will be sure to share my findings. Thanks, Mike! On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly ch...@fregly.com wrote: hey mike! you'll definitely want to increase your parallelism by adding more shards to the stream - as well as spinning up 1 receiver per shard

Spark Streaming S3 Performance Implications

2015-03-18 Thread Mike Trienis
, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. Thanks, Mike.

Re: Writing to S3 and retrieving folder names

2015-03-05 Thread Mike Trienis
Please ignore my question, you can simply specify the root directory and it looks like redshift takes care of the rest. copy mobile from 's3://BUCKET_NAME/' credentials json 's3://BUCKET_NAME/jsonpaths.json' On Thu, Mar 5, 2015 at 3:33 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi

Writing to S3 and retrieving folder names

2015-03-05 Thread Mike Trienis
/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.dstream.DStream Anyone have any ideas? Thanks, Mike.

Pushing data from AWS Kinesis - Spark Streaming - AWS Redshift

2015-03-01 Thread Mike Trienis
it make more sense to push data from AWS Kinesis to AWS Redshift VIA another standalone approach such as the AWS Kinesis connectors. Thanks, Mike.

  1   2   >