Re: Memory problems with simple ETL in Pyspark

2017-04-15 Thread ayan guha
What i missed is try increasing number of partitions using repartition On Sun, 16 Apr 2017 at 11:06 am, ayan guha <guha.a...@gmail.com> wrote: > It does not look like scala vs python thing. How big is your audience data > store? Can it be broadcasted? > > What is the me

Re: Memory problems with simple ETL in Pyspark

2017-04-15 Thread ayan guha
output=[id#0L, label#183, > collect_list_is#197]) >+- *Sort [id#0L ASC NULLS FIRST, label#183 ASC > NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#0L, > label#183, 200) > +- *Project [id#0L, indexedSegs#93, > cast(label_val#99 as double) AS label#183] > +- InMemoryTableScan [id#0L, > indexedSegs#93, label_val#99] > +- InMemoryRelation [id#0L, > label_val#99, segment#2, indexedSegs#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *Project [id#0L, > label#1 AS label_val#99, segment#2, UDF(segment#2) AS indexedSegs#93] >+- HiveTableScan > [id#0L, label#1, segment#2], MetastoreRelation pmccarthy, all_data_long > > > -- Best Regards, Ayan Guha

Re: checkpoint

2017-04-13 Thread ayan guha
d on more than 150 columns it replace ' ' by 0.0 > that all. > > regards > -- Best Regards, Ayan Guha

Re: is there a way to persist the lineages generated by spark?

2017-04-03 Thread ayan guha
> even possible with spark? > > Thanks, > kant > -- Best Regards, Ayan Guha

Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread ayan guha
is delaying the execution by > 40 seconds. Is this due to time taken for Query Parsing and Query Planning > for the Complex SQL? If thats the case; how do we optimize this Query > Parsing and Query Planning time in Spark? Any help would be helpful. > > > Thanks > > Sathish > -- Best Regards, Ayan Guha

Re: Need help for RDD/DF transformation.

2017-03-30 Thread ayan guha
; 5, b > > > > Is there an efficient way to do this? > > Any help will be great. > > > > Thank you. > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: calculate diff of value and median in a group

2017-03-22 Thread ayan guha
Seq[Double])Double > > scala> ds.groupByKey(_._1).mapGroups((id, iter) => (id, > median(iter.map(_._2).toSeq))).show > +---+-+ > | _1| _2| > +---+-+ > |101|0.355| > |100| 0.43| > +---+-+ > > > Yong > > > > > -

Re: calculate diff of value and median in a group

2017-03-22 Thread ayan guha
name3,0.74,0.74,0.0 > 0,105,name1,0.41,0.41,0.0 > 0,105,name2,0.65,0.41,0.24 > 0,105,name3,0.29,0.41,0.24 > 1,100,name1,0.51,0.51,0.0 > 1,100,name2,0.51,0.51,0.0 > 1,100,name3,0.43,0.51,0.08 > 1,101,name1,0.59,0.59,0.0 > 1,101,name2,0.55,0.59,0.04 > 1,101,name3,0.84,0.59,0.25 > 1,102,name1,0.01,0.44,0.43 > 1,102,name2,0.98,0.44,0.54 > 1,102,name3,0.44,0.44,0.0 > 1,103,name1,0.47,0.16,0.31 > 1,103,name2,0.16,0.16,0.0 > 1,103,name3,0.02,0.16,0.14 > 1,104,name1,0.83,0.83,0.0 > 1,104,name2,0.89,0.83,0.06 > 1,104,name3,0.31,0.83,0.52 > 1,105,name1,0.59,0.59,0.0 > 1,105,name2,0.77,0.59,0.18 > 1,105,name3,0.45,0.59,0.14 > > Thanks for any help! > > Cheers, > Craig > > -- Best Regards, Ayan Guha

Re: Local spark context on an executor

2017-03-21 Thread ayan guha
> Thanks, > Shashank > > On Tue, Mar 21, 2017 at 3:37 PM, ayan guha <guha.a...@gmail.com> wrote: > >> What is your use case? I am sure there must be a better way to solve >> it >> >> On Wed, Mar 22, 2017 at 9:34 AM, Shashank Mandil < >> mandil

Re: Local spark context on an executor

2017-03-21 Thread ayan guha
tors on the hadoop > datanodes for processing. > > Is it possible for me to create a local spark context (master=local) on > these executors to be able to get a spark context ? > > Theoretically since each executor is a java process this should be doable > isn't it ? > > Thanks, > Shashank > -- Best Regards, Ayan Guha

Re: spark-sql use case beginner question

2017-03-08 Thread ayan guha
condition where complex filtering > > So please help what would be best approach and why i should not give > entire script for hivecontext to make its own rdds and run on spark if we > are able to do it > > coz all examples i see online are only showing hc.sql("select * from > table1) and nothing complex than that > > > -- Best Regards, Ayan Guha

Re: Failed to connect to master ...

2017-03-07 Thread ayan guha
an fix it (connect to SPARK in VM and > read form KAfKA in VM)? > > - Why using "local[1]" no exception is thrown and how to setup to read from > kafka in VM? > > *- How to stream from Kafka (data in the topic is in json format)? > * > Your input is appreciated! > > Best regards, > Mina > > > > -- Best Regards, Ayan Guha

Re: finding Spark Master

2017-03-07 Thread ayan guha
rying to figure out what goes in setMaster() aside from > local[*]. > > > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics, LLC > 913.938.6685 > > www.massstreet.net > > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > > -- Best Regards, Ayan Guha

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread ayan guha
gt; So ideally I want to keep all the records in memory, but distributed over > the different nodes in the cluster. Does this mean sharing a SparkContext > between queries, or is this where HDFS comes in, or is there something else > that would be better suited? > > Or is there another overall approach I should look into for executing > queries in "real time" against a dataset this size? > > Thanks, > Allan. > > -- Best Regards, Ayan Guha

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread ayan guha
uot;).as("count")).show() > > > > Thanks in advance > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > -- Best Regards, Ayan Guha

Re: Spark 2.0 issue with left_outer join

2017-03-04 Thread ayan guha
ata it fails. Test > case with small dataset of 50 records on local box runs fine. > > Thanks > Ankur > > Sent from my iPhone > > On Mar 4, 2017, at 12:09 PM, ayan guha <guha.a...@gmail.com> wrote: > > Just to be sure, can you reproduce the error using sql api? >

Re: Spark 2.0 issue with left_outer join

2017-03-04 Thread ayan guha
Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Thanks > Ankur > > > -- Best Regards, Ayan Guha

Re: DataFrame from in memory datasets in multiple JVMs

2017-02-28 Thread ayan guha
y-datasets-in-multiple-JVMs-tp28438.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: SPark - YARN Cluster Mode

2017-02-27 Thread ayan guha
n use that instead. > > > On Sun, Feb 26, 2017 at 4:52 PM, ayan guha <guha.a...@gmail.com> wrote: > > Hi > > > > I am facing an issue with Cluster Mode, with pyspark > > > > Here is my code: > > > > conf = SparkConf() > >

Re: SPark - YARN Cluster Mode

2017-02-26 Thread ayan guha
, Feb 27, 2017 at 11:52 AM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > I am facing an issue with Cluster Mode, with pyspark > > Here is my code: > > conf = SparkConf() > conf.setAppName("Spark Ingestion") > con

SPark - YARN Cluster Mode

2017-02-26 Thread ayan guha
_test.py I tried the same code with deploy-mode=client and all config are passing fine. Am I missing something? Will introducing --property-file be of any help? Can anybody share some working example? Best Ayan -- Best Regards, Ayan Guha

Re: Spark runs out of memory with small file

2017-02-26 Thread ayan guha
init" >>>> ...: for line in l: >>>> ...: if line[0:15] == 'WARC-Record-ID:': >>>> ...: the_id = line[15:] >>>> ...: d[the_id] = line >>>> ...: final.append(Row(**d)) >>>> ...: return final >>>> >>>> In [12]: rdd2 = rdd1.map(process_file) >>>> >>>> In [13]: rdd2.count() >>>> 17/02/25 19:03:03 ERROR YarnScheduler: Lost executor 5 on >>>> ip-172-31-41-89.us-west-2.compute.internal: Container killed by YARN >>>> for exceeding memory limits. 10.3 GB of 10.3 GB physical memory used. >>>> Consider boosting spark.yarn.executor.memoryOverhead. >>>> 17/02/25 19:03:03 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: >>>> Container killed by YARN for exceeding memory limits. 10.3 GB of 10.3 GB >>>> physical memory used. Consider boosting spark.yarn.executor.memoryOver >>>> head. >>>> 17/02/25 19:03:03 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID >>>> 5, ip-172-31-41-89.us-west-2.compute.internal, executor 5): >>>> ExecutorLostFailure (executor 5 exited caused by one of the running tasks) >>>> Reason: Container killed by YARN for exceeding memory limits. 10.3 GB of >>>> 10.3 GB physical memory used. Consider boosting >>>> spark.yarn.executor.memoryOverhead. >>>> >>>> >>>> -- >>>> Henry Tremblay >>>> Robert Half Technology >>>> >>>> >>>> - >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >>> -- >>> Henry Tremblay >>> Robert Half Technology >>> >>> >> > -- Best Regards, Ayan Guha

Re: CSV DStream to Hive

2017-02-21 Thread ayan guha
0.n3.nabble.com/CSV-DStream-to-Hive-tp28410.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread ayan guha
Then I ran the query and the driver got extremely high CPU percentage and > the process get stuck and I need to kill it. > I am running at client mode that submit to a Mesos cluster. > > I am using* Spark 2.0.2* and my data store in *HDFS* with *parquet format* > . > > Any advices for me i

Re: Basic Grouping Question

2017-02-20 Thread ayan guha
> > But when I group the data (groupBy-function), it returns a > RelationalDatasetGroup. On this I cannot apply the map and reduce function. > > I have the feeling that I am running in the wrong direction. Does anyone > know how to approach this? (I hope I explained it right, so it can be > understand :)) > > Regards, > Marco > -- Best Regards, Ayan Guha

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

2017-02-19 Thread ayan guha
le >nervous that we would be faking Kinesis Client Library's protocol by >writing a checkpoint into Dynamo > > > Thanks in advance! > > Neil > -- Best Regards, Ayan Guha

Re: Spark Thrift Server - Skip header when load data from local file

2017-02-14 Thread ayan guha
w format delimited fields > terminated by ',' tblproperties("skip.header.line.count"="1"); > load data local inpath 'tabname.csv' overwrite into table tabname; > > How can i achieve this? Is there any other solution or workaround. > -- Best Regards, Ayan Guha

Re: Etl with spark

2017-02-12 Thread ayan guha
about 10 mins or so because I'm iteratively going down a > list of keys > > Is it possible to asynchronously do this? > > FYI I'm using spark.read.json to read from s3 because it infers my schema > > Regards > Sam > -- Best Regards, Ayan Guha

Re: Remove dependence on HDFS

2017-02-12 Thread ayan guha
I would like to > hear your thoughts. > > Cheers, > Ben > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Un-exploding / denormalizing Spark SQL help

2017-02-08 Thread ayan guha
gt;>> into something that looks like > >>> > >>> > >>> > +---++--+--+-+--+--+-+--+--+-+ > >>> |id |name|extra1|data1 |priority1|extra2|data2 |priority2|extra3|data3 > >>> |priority3| > >>> > >>> > +---++--+--+-+--+--+-+--+--+-+ > >>> |1 |Fred|8 |value1|1|8 |value8|2|8 > |value5|3 > >>> | > >>> |2 |Amy |9 |value3|1|9 |value5|2|null |null > >>> |null | > >>> > >>> > +---++--+--+-+--+--+-+--+--+-+ > >>> > >>> If I were going the other direction, I'd create a new column with an > >>> array of structs, each with 'extra', 'data', and 'priority' fields and > then > >>> explode it. > >>> > >>> Going from the more normalized view, though, I'm having a harder time. > >>> > >>> I want to group or partition by (id, name) and order by priority, but > >>> after that I can't figure out how to get multiple rows rotated into > one. > >>> > >>> Any ideas? > >>> > >>> Here's the code to create the input table above: > >>> > >>> import org.apache.spark.sql.Row > >>> import org.apache.spark.sql.Dataset > >>> import org.apache.spark.sql.types._ > >>> > >>> val rowsRDD = sc.parallelize(Seq( > >>> Row(1, "Fred", 8, "value1", 1), > >>> Row(1, "Fred", 8, "value8", 2), > >>> Row(1, "Fred", 8, "value5", 3), > >>> Row(2, "Amy", 9, "value3", 1), > >>> Row(2, "Amy", 9, "value5", 2))) > >>> > >>> val schema = StructType(Seq( > >>> StructField("id", IntegerType, nullable = true), > >>> StructField("name", StringType, nullable = true), > >>> StructField("extra", IntegerType, nullable = true), > >>> StructField("data", StringType, nullable = true), > >>> StructField("priority", IntegerType, nullable = true))) > >>> > >>> val data = sqlContext.createDataFrame(rowsRDD, schema) > >>> > >>> > >>> > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > -- Best Regards, Ayan Guha

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread ayan guha
create 6 executors, but i want to run all > this jobs on same spark application. > > How can achieve adding jobs to an existing spark application? > > I don't understand why SparkContext.getOrCreate() don't get existing > spark context. > > > Thanks, > > Cosmin P. > -- Best Regards, Ayan Guha

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
he > columns in the dataframe and try to match them with the StructType? > > I hope I cleared things up, What I wouldnt do for a drawing board right > now! > > > On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <guha.a...@gmail.com> wrote: > >> UmmI think the

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
tically > > In your example you are specifying the numeric columns and I would like it > to be applied to any schema if that makes sense > On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote: > >> SImple (pyspark) example: >> >>

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com> >> escreveu: >> >> Hi All >> >> I would like to specify a schema when reading from a json but when trying >> to map a number to a Double it fails, I tried FloatType and IntType with no >> joy! >> >> >> When inferring the schema customer id is set to String, and I would like >> to cast it as Double >> >> so df1 is corrupted while df2 shows >> >> >> Also FYI I need this to be generic as I would like to apply it to any >> json, I specified the below schema as an example of the issue I am facing >> >> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, >> DoubleType,FloatType, StructType, LongType,DecimalType} >> val testSchema = StructType(Array(StructField("customerid",DoubleType))) >> val df1 = >> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}"""))) >> val df2 = >> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}"""))) >> df1.show(1) >> df2.show(1) >> >> >> Any help would be appreciated, I am sure I am missing something obvious >> but for the life of me I cant tell what it is! >> >> >> Kind Regards >> Sam >> >> >> >> >> >> >> -- Best Regards, Ayan Guha

Re: filters Pushdown

2017-02-02 Thread ayan guha
view, copying, or distribution of this email (or any attachments thereto) > by others is strictly prohibited. If you are not the intended recipient, > please contact the sender immediately and permanently delete the original > and any copies of this email and any attachments thereto. > -- Best Regards, Ayan Guha

Re: Text

2017-01-27 Thread ayan guha
ve the same text with some corrected words. > > thanks! > > Soheila > > > -- Best Regards, Ayan Guha

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread ayan guha
DD counting. The >> streaming interval is 1 minute, during which time several shards have >> received data. Each minute interval, for this particular example, the >> driver prints out between 20 and 30 for the count value. I expected to see >> the count operation parallelized across the cluster. I think I must just be >> misunderstanding something fundamental! Can anyone point out where I'm >> going wrong? >> >> Yours in confusion, >> Graham >> >> > > > -- > --- > Takeshi Yamamuro > -- Best Regards, Ayan Guha

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread ayan guha
/blob/master/sql/core/ > src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L729 > Or, how about using Double or something instead of Numeric? > > // maropu > > On Fri, Jan 27, 2017 at 10:25 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Okay, it is workin

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread ayan guha
Okay, it is working with varchar columns only. Is there any way to workaround this? On Fri, Jan 27, 2017 at 12:22 PM, ayan guha <guha.a...@gmail.com> wrote: > hi > > I thought so too, so I created a table with INT and Varchar columns > > desc agtes

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread ayan guha
.com> wrote: > Hi, > > I think you got this error because you used `NUMERIC` types in your schema > (https://github.com/apache/spark/blob/master/sql/core/ > src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L32). So, > IIUC avoiding the type is a workaround. > > //

Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread ayan guha
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) -- Best Regards, Ayan Guha

Re: Spark SQL DataFrame to Kafka Topic

2017-01-24 Thread ayan guha
park.sql.SQLContext(sc);* > > *val df = sqlContext.read.parquet("/temp/*.parquet")* > > *df.registerTempTable("beacons")* > > > I want to directly ingest df DataFrame to Kafka ! Is there any way to > achieve this ?? > > > Cheers, > > Senthil > > > > > > -- Best Regards, Ayan Guha

Re: Will be in around 12:30pm due to some personal stuff

2017-01-19 Thread ayan guha
sonal stuff > <http://apache-spark-user-list.1001560.n3.nabble.com/Will-be-in-around-12-30pm-due-to-some-personal-stuff-tp28326.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > -- Best Regards, Ayan Guha

Re: Flume and Spark Streaming

2017-01-16 Thread ayan guha
rk > If Spark dies, I have no idea if Spark is going to reprocessing same > data again when it is sent again. > Coult it be different if I use Kafka Channel? > > > > > -- Best Regards, Ayan Guha

Re: Old version of Spark [v1.2.0]

2017-01-15 Thread ayan guha
> *Md. Rezaul Karim*, BSc, MSc > PhD Researcher, INSIGHT Centre for Data Analytics > National University of Ireland, Galway > IDA Business Park, Dangan, Galway, Ireland > Web: http://www.reza-analytics.eu/index.html > <http://139.59.184.114/index.html> > > On 15 Januar

Re: Old version of Spark [v1.2.0]

2017-01-15 Thread ayan guha
gt; PhD Researcher, INSIGHT Centre for Data Analytics > National University of Ireland, Galway > IDA Business Park, Dangan, Galway, Ireland > Web: http://www.reza-analytics.eu/index.html > <http://139.59.184.114/index.html> > -- Best Regards, Ayan Guha

Re: Can't load a RandomForestClassificationModel in Spark job

2017-01-12 Thread ayan guha
Hi Given training and predictions are two different applications, I typically save model objects to hdfs and load it back during prediction map stages. Best Ayan On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh wrote: > Hi all, > I've been working with Spark mllib 2.0.2

Re: Add row IDs column to data frame

2017-01-12 Thread ayan guha
- Spark Consulting > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Append-column-to-Data-Frame-or-RDD- > tp22385p28300.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: how to change datatype by useing StructType

2017-01-11 Thread ayan guha
Do you have year in your data? On Thu, 12 Jan 2017 at 5:24 pm, lk_spark wrote: > > > > > > > > > > > > > > > > > > > hi,all > > > I have a txt file ,and I want to process it as dataframe > > : > > > > > > data like this : > > >name1,30 > > >name2,18 > >

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread ayan guha
taset 2.0 ? > > > > > > Thank you > > Anil Langote > > +1-425-633-9747 <+1%20425-633-9747> > > > > > > *From: *ayan guha <guha.a...@gmail.com> > *Date: *Sunday, January 8, 2017 at 10:32 PM > *To: *Anil Langote <anillangote0

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread ayan guha
but in general spark might not be the > best tool for the job. Have you considered having Spark output to > something like memcache? > > What's the goal of you are trying to accomplish? > > On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com> > wrote: > >> Hi All, >> >> I have a requirement where I wanted to build a distributed HashMap which >> holds 10M key value pairs and provides very efficient lookups for each key. >> I tried loading the file into JavaPairedRDD and tried calling lookup method >> its very slow. >> >> How can I achieve very very faster lookup by a given key? >> >> Thank you >> Anil Langote >> > -- Best Regards, Ayan Guha

Re: Approach: Incremental data load from HBASE

2017-01-06 Thread ayan guha
y which does > the job for me, or other alternative approaches can be done through reading > Hbase tables in RDD and saving RDD to Hive. > > Thanks. > > > On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi Chetan >> >> What

Re: Approach: Incremental data load from HBASE

2017-01-04 Thread ayan guha
t; ingest it with Linkedin Gobblin to HDFS / S3. >>>>> >>>>> *Approach 2:* >>>>> >>>>> Run Scheduled Spark Job - Read from HBase and do transformations and >>>>> maintain flag column at HBase Level. >>>>> >>>>> In above both approach, I need to maintain column level flags. such as >>>>> 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will >>>>> take another 1000 rows of batch where flag is 0 or 1. >>>>> >>>>> I am looking for best practice approach with any distributed tool. >>>>> >>>>> Thanks. >>>>> >>>>> - Chetan Khatri >>>>> >>>> >>>> >>> >> > -- Best Regards, Ayan Guha

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
rned. This option applies only to reading. > > So my interpretation is all rows in the table are ingested, and this > "lowerBound" and "upperBound" is the span of each partition. Well, I am not > a native English speaker, maybe it means differently? > > Best regards, > Y

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
t specify any "start row" in the job, it will always > ingest the entire table. So I also cannot simulate a streaming process by > starting the job in fix intervals... > > Best regards, > Yang > > 2017-01-03 15:06 GMT+01:00 ayan guha <guha.a...@gmail.com>: > &g

Re: Broadcast Join and Inner Join giving different result on same DataFrame

2017-01-03 Thread ayan guha
in using simple inner join i get dataframe with joined >> values whereas if i do broadcast join i get empty dataframe with empty >> values. I am not able to explain this behavior. Ideally both should give >> the same result. >> >> What could have gone wrong. Any one faced the similar issue? >> >> >> Thanks, >> Prateek >> >> >> >> >> > > -- Best Regards, Ayan Guha

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
of reinventing the wheel. At this moment I have not discovered any > existing tool can parallelize ingestion tasks on one database. Is Sqoop a > proper candidate from your knowledge? > > Thank you again and have a nice day. > > Best regards, > Yang > > > > 2016-12-3

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread ayan guha
k application > <http://stackoverflow.com/questions/41236661/failed-to-get-broadcast-1-piece0-of-broadcast-1-in-pyspark-application> > > > Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application > I was building an application on Apache Spark 2.00 with Python 3.4 and > trying to load some CSV files from HDFS (... > > <http://stackoverflow.com/questions/41236661/failed-to-get-broadcast-1-piece0-of-broadcast-1-in-pyspark-application> > > > Could you please provide suggestion registering myself in Apache User list > or how can I get suggestion or support to debug the problem I am facing? > > Your response will be highly appreciated. > > > > Thanks & Best Regards, > Engr. Palash Gupta > WhatsApp/Viber: +8801817181502 > Skype: palash2494 > > > > > > -- Best Regards, Ayan Guha

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread ayan guha
uring execution. >>> >>> Then I looked into Structured Streaming. It looks much more promising >>> because window operations based on event time are considered during >>> streaming, which could be the solution to my use case. However, from >>> documentation and code example I did not find anything related to streaming >>> data from a growing database. Is there anything I can read to achieve my >>> goal? >>> >>> Any suggestion is highly appreciated. Thank you very much and have a >>> nice day. >>> >>> Best regards, >>> Yang >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> > -- Best Regards, Ayan Guha

Re: Best way to process lookup ETL with Dataframes

2016-12-29 Thread ayan guha
h to implement an algorithm like this? Note also > that some fields I am implementing require multiple staged split/merge > steps due to cascading lookup joins. > > > Thanks, > > > *Michael Sesterhenn* > > > *msesterh...@cars.com <msesterh...@cars.com> * > > -- Best Regards, Ayan Guha

Merri Christmas and Happy New Year

2016-12-25 Thread ayan guha
Merri Christmas to all spark users. May your new year will be more "structured" with "streaming" success:)

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread ayan guha
Try providing correct driver name through property variable in the jdbc call. On Thu., 22 Dec. 2016 at 8:40 am, Mich Talebzadeh wrote: > This works with Spark 2 with Oracle jar file added to > > > > > > $SPARK_HOME/conf/ spark -defaults.conf > > > > > > > > > > > > > >

Re: How to perform Join operation using JAVARDD

2016-12-19 Thread ayan guha
What's your desired output? On Sat., 17 Dec. 2016 at 9:50 pm, Sree Eedupuganti wrote: > I tried like this, > > *CrashData_1.csv:* > > *CRASH_KEYCRASH_NUMBER CRASH_DATECRASH_MONTH* > *2016899114 2016899114 01/02/2016 12:00:00 > AM

Re: How to get recent value in spark dataframe

2016-12-19 Thread ayan guha
You have 2 parts to it 1. Do a sub query where for each primary key derive latest value of flag=1 records. Ensure you get exactly 1 record per primary key value. Here you can use rank() over (partition by primary key order by year desc) 2. Join your original dataset with the above on primary

Re: Wrting data from Spark streaming to AWS Redshift?

2016-12-09 Thread ayan guha
Stream. Please share your > experience and reference docs. > > Thanks > -- Best Regards, Ayan Guha

[no subject]

2016-12-06 Thread ayan guha
, but thought to check here as well if there are any solution available. -- Best Regards, Ayan Guha

Re: Access multiple cluster

2016-12-04 Thread ayan guha
ternal table in > the processing cluster to the analytics cluster. However, this has to be > supported by appropriate security configuration and might be less an > efficient then copying the data. > > On 4 Dec 2016, at 22:45, ayan guha <guha.a...@gmail.com> wrote: > > Hi > >

Access multiple cluster

2016-12-04 Thread ayan guha
Hi Is it possible to access hive tables sitting on multiple clusters in a single spark application? We have a data processing cluster and analytics cluster. I want to join a table from analytics cluster with another table in processing cluster and finally write back in analytics cluster. Best

Re: Spark sql generated dynamically

2016-12-02 Thread ayan guha
You are looking for window functions. On 2 Dec 2016 22:33, "Georg Heiler" wrote: > Hi, > > how can I perform a group wise operation in spark more elegant? Possibly > dynamically generate SQL? Or would you suggest a custom UADF? >

Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread ayan guha
Thanks TD. Will it be available in pyspark too? On 1 Dec 2016 19:55, "Tathagata Das" wrote: > In the meantime, if you are interested, you can read the design doc in the > corresponding JIRA - https://issues.apache.org/jira/browse/SPARK-18124 > > On Thu, Dec 1, 2016

Re: Spark Job not exited and shows running

2016-11-30 Thread ayan guha
Can you add sc.stop at the end of the code and try? On 1 Dec 2016 18:03, "Daniel van der Ende" wrote: > Hi, > > I've seen this a few times too. Usually it indicates that your driver > doesn't have enough resources to process the result. Sometimes increasing > driver

Re: time to run Spark SQL query

2016-11-28 Thread ayan guha
They should take same time if everything else is constant On 28 Nov 2016 23:41, "Hitesh Goyal" wrote: > Hi team, I am using spark SQL for accessing the amazon S3 bucket data. > > If I run a sql query by using normal SQL syntax like below > > 1) DataFrame

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha
t rack topology. > > > > -Yeshwanth > Can you Imagine what I would do if I could do all I can - Art of War > > On Tue, Nov 22, 2016 at 6:37 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Because snappy is not splittable, so single task makes sense. >> >> Are s

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha
Because snappy is not splittable, so single task makes sense. Are sure about rack topology? Ie 225 is in a different rack than 227 or 228? What does your topology file says? On 22 Nov 2016 10:14, "yeshwanth kumar" wrote: > Thanks for your reply, > > i can definitely

Re: dataframe data visualization

2016-11-20 Thread ayan guha
0:23 AM, wenli.o...@alibaba-inc.com < > wenli.o...@alibaba-inc.com> wrote: > >> Hi anyone, >> >> is there any easy way for me to do data visualization in spark using >> scala when data is in dataframe format? Thanks. >> >> Wayne Ouyang > > > -- Best Regards, Ayan Guha

Re: Flume integration

2016-11-20 Thread ayan guha
Hi While I am following this discussion with interest, I am trying to comprehend any architectural benefit of a spark sink. Is there any feature in flume makes it more suitable to ingest stream data than sppark streaming, so that we should chain them? For example does it help durability or

Re: using StreamingKMeans

2016-11-19 Thread ayan guha
StramingKMeans. > > Suggestions ? > > regards. > > On Sun, 20 Nov 2016 at 3:51 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Curious why do you want to train your models every 3 secs? >> On 20 Nov 2016 06:25, "Debasish Ghosh" <ghosh.debas...@gm

Re: using StreamingKMeans

2016-11-19 Thread ayan guha
Curious why do you want to train your models every 3 secs? On 20 Nov 2016 06:25, "Debasish Ghosh" wrote: > Thanks a lot for the response. > > Regarding the sampling part - yeah that's what I need to do if there's no > way of titrating the number of clusters online. > >

Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread ayan guha
Hi I think you can use map reduce paradigm here. Create a key using user ID and date and record as a value. Then you can express your operation (do something) part as a function. If the function meets certain criteria such as associative and cumulative like, say Add or multiplication, you can

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread ayan guha
There is an utility called dos2unix. You can give it a try On 18 Nov 2016 00:20, "Jörn Franke" wrote: > > You can do the conversion of character set (is this the issue?) as part of your loading process in Spark. > As far as i know the spark CSV package is based on Hadoop

Re: use case reading files split per id

2016-11-14 Thread ayan guha
How about following approach - - get the list of ID - get one rdd each for them using wholetextfile - map and flatmap to generate pair rdd with ID as key and list as value - union all the RDD s together - group by key On 15 Nov 2016 16:43, "Mo Tao" wrote: > Hi ruben, > > You may

Re: Grouping Set

2016-11-14 Thread ayan guha
And, run the same SQL in hive and post any difference. On 15 Nov 2016 07:48, "ayan guha" <guha.a...@gmail.com> wrote: > It should be A,yes. Can you please reproduce this with small data and > exact SQL? > On 15 Nov 2016 02:21, "Andrés Ivaldi" <iaiva...@g

Re: Grouping Set

2016-11-14 Thread ayan guha
It should be A,yes. Can you please reproduce this with small data and exact SQL? On 15 Nov 2016 02:21, "Andrés Ivaldi" wrote: > Hello, I'm tryin to use Grouping Set, but I dont know if it is a bug or > the correct behavior. > > Givven the above example > Select a,b,sum(c)

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread ayan guha
>>> could easily reproduce this issue on high load . > >>> > >>> Note: I even tried reduceByKeyAndWindow with duration of 5 seconds > >>> instead of reduceByKey Operation, But even that didn't > >>> help.uniqueCampaigns.reduceByKeyAndWindow((c1,c2)=>c1, > Durations.Seconds(5), > >>> Durations.Seconds(5)) > >>> > >>> I have even requested for help on Stackoverflow , But I haven't > received > >>> any solutions to this issue. > >>> > >>> Stack Overflow Link > >>> > >>> > >>> https://stackoverflow.com/questions/40559858/spark- > streaming-reducebykey-not-removing-duplicates-for-the-same-key-in-a-batch > >>> > >>> > >>> Thanks and Regards > >>> Dev > > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Exception not failing Python applications (in yarn client mode) - SparkLauncher says app succeeded, where app actually has failed

2016-11-12 Thread ayan guha
It's a known issue. There was one jira which was marked resolved but I still faced it in 1.6. On 12 Nov 2016 14:33, "Elkhan Dadashov" wrote: > Hi, > > *Problem*: > Spark job fails, but RM page says the job succeeded, also > > appHandle = sparkLauncher.startApplication() >

Re: DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread ayan guha
You can explore grouping sets in SQL and write an aggregate function to add array wise sum. It will boil down to something like Select attr1,attr2...,yourAgg(Val) >From t Group by attr1,attr2... Grouping sets((attr1,attr2),(aytr1)) On 12 Nov 2016 04:57, "Anil Langote"

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread ayan guha
Yes, it can be done and a standard practice. I would suggest a mixed approach: use Informatica to create files in hdfs and have hive staging tables as external tables on those directories. Then that point onwards use spark. Hth Ayan On 10 Nov 2016 04:00, "Mich Talebzadeh"

Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread ayan guha
p://apache-spark-user-list. > 1001560.n3.nabble.com/Newbie-question-Best-way-to- > bootstrap-with-Spark-tp28032.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread ayan guha
I would go for partition by option. It seems simple and yes, SQL inspired :) On 4 Nov 2016 00:59, "Rabin Banerjee" wrote: > Hi Koert & Robin , > > * Thanks ! *But if you go through the blog https://bzhangusc. >

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
d to Fit against new dataset > to produce new model" I said this in context of re-training the model. Is > it not correct? Isn't it part of re-training? > > Thanks > > On Tue, Nov 1, 2016 at 4:01 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> &g

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
algorithm > (IDF, NaiveBayes) > > Thanks > > On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <guha.a...@gmail.com> wrote: > >> I have come across similar situation recently and decided to run >> Training workflow less frequently than scoring workflow. >> >&

Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread ayan guha
There are options to specify external jars in the form of --jars, --driver-classpath etc depending on spark version and cluster manager.. Please see spark documents for configuration sections and/or run spark submit help to see available options. On 1 Nov 2016 23:13, "Jan Botorek"

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
I have come across similar situation recently and decided to run Training workflow less frequently than scoring workflow. In your use case I would imagine you will run IDF fit workflow once in say a week. It will produce a model object which will be saved. In scoring workflow, you will typically

Re: spark dataframe rolling window for user define operation

2016-10-29 Thread ayan guha
").rowsBetween(-20, +20) > > var dfWithAlternate = df.withColumn( "alter",*XYZ*(df("c2")).over(wSpec1)) > > > > Where XYZ function can be - +,-,+,- alternatively > > > > > > PS : I have posted the same question at http://stackoverflow.com/ > questions/40318010/spark-dataframe-rolling-window-user-define-operation > > > > Regards, > > Kiran > -- Best Regards, Ayan Guha

Re: HiveContext is Serialized?

2016-10-26 Thread ayan guha
In your use case, your dedf need not to be a data frame. You could use SC.textFile().collect. Even better you can just read off a local file, as your file is very small, unless you are planning to use yarn cluster mode. On 26 Oct 2016 16:43, "Ajay Chander" wrote: > Sean,

Re: Spark 1.2

2016-10-25 Thread ayan guha
Thank you both. On Tue, Oct 25, 2016 at 11:30 PM, Sean Owen <so...@cloudera.com> wrote: > archive.apache.org will always have all the releases: > http://archive.apache.org/dist/spark/ > > On Tue, Oct 25, 2016 at 1:17 PM ayan guha <guha.a...@gmail.com> wrote: > >>

Spark 1.2

2016-10-25 Thread ayan guha
Just in case, anyone knows how I can download Spark 1.2? It is not showing up in Spark download page drop down -- Best Regards, Ayan Guha

Re: Can i display message on console when use spark on yarn?

2016-10-20 Thread ayan guha
= > 16/10/20 17:13:13 Thread-3 INFO org.apache.spark.util.ShutdownHookManager>SPK> > Shutdown hook called > 16/10/20 17:13:13 Thread-3 INFO org.apache.spark.util.ShutdownHookManager>SPK> > Deleting directory /data/home/spark_tmp/spark-5b9f434b-5837-46e6-9625- > c4656b86af9e > > Thanks. > > > -- Best Regards, Ayan Guha

Re: pyspark dataframe codes for lead lag to column

2016-10-20 Thread ayan guha
gt; > is there pyspark dataframe codes for lead lag to column? > > > > lead/lag column is something > > > > 1 lag -1lead 2 > > 213 > > 324 > > 4 3 5 > > 54 -1 > -- Best Regards, Ayan Guha

Re: [Spark][issue]Writing Hive Partitioned table

2016-10-19 Thread ayan guha
age or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 7 October 2016

<    1   2   3   4   5   6   7   8   >