Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Thank you for the clarification. That was my understanding too. However how to provide the upper bound as it changes for every call in real life. For example it is not required for sqoop. On Mon, 6 Nov 2017 at 8:20 am, Nicolas Paris wrote: > Le 05 nov. 2017 à 22:02, ayan guha écriv

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Yes, my thought exactly. Kindly let me know if you need any help to port in pyspark. On Mon, Nov 6, 2017 at 8:54 AM, Nicolas Paris wrote: > Le 05 nov. 2017 à 22:46, ayan guha écrivait : > > Thank you for the clarification. That was my understanding too. However > how to > >

Re: Json Parsing.

2017-12-06 Thread ayan guha
On Thu, 7 Dec 2017 at 11:37 am, ayan guha wrote: > You can use get_json function > > On Thu, 7 Dec 2017 at 10:39 am, satyajit vegesna < > satyajit.apas...@gmail.com> wrote: > >> Does spark support automatic detection of schema from a json string in a >> datafr

Re: Json Parsing.

2017-12-06 Thread ayan guha
advice or help would be appreciated. > > Regards, > Satyajit. > -- Best Regards, Ayan Guha

Re: JDBC to hive batch use case in spark

2017-12-09 Thread ayan guha
s works fine but when I ran the save >> call on hive context to write data, it throws the exception and it says the >> table or view does not exists even though the table is precreated in hive. >> >> Please help if anyone tried such scenario. >> >> Thanks >> > -- Best Regards, Ayan Guha

Re: Passing an array of more than 22 elements in a UDF

2017-12-22 Thread ayan guha
st me how to take > more than 22 parameters in an UDF? I mean, if I want to pass all the > parameters as an array of integers? > > Thanks, > Aakash. > -- Best Regards, Ayan Guha

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread ayan guha
;>> On Mon, Jan 15, 2018 at 11:41 PM, kant kodali >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I am wondering if HDFS can be a streaming source like Kafka in Spark >>>>> 2.2.0? For example can I have stream1 reading from Kafka and writing to >>>>> HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that >>>>> stream2 will be pulling the latest updates written by stream1. >>>>> >>>>> Thanks! >>>>> >>>> >>>> >>> >> > -- Best Regards, Ayan Guha

mapGroupsWithState in Python

2018-01-28 Thread ayan guha
dated').cast('timestamp').alias('last_updated'), unpackedDF.jsonData) dedupDF = dataDF.dropDuplicates(subset=['id']).withWatermark('last_updated','24 hours') So it is not working. Any help is appreciated. -- Best Regards, Ayan Guha

Re: mapGroupsWithState in Python

2018-01-30 Thread ayan guha
Any help would be much appreciated :) On Mon, Jan 29, 2018 at 6:25 PM, ayan guha wrote: > Hi > > I want to write something in Structured streaming: > > 1. I have a dataset which has 3 columns: id, last_update_timestamp, > attribute > 2. I am receiving the data through

Re: mapGroupsWithState in Python

2018-01-31 Thread ayan guha
tateful-stream-processing-in-structured-streaming > > > > > > > > On Tue, Jan 30, 2018 at 5:14 AM, ayan guha wrote: > >> Any help would be much appreciated :) >> >> On Mon, Jan 29, 2018 at 6:25 PM, ayan guha wrote: >> >>> Hi >>>

Re: Can spark handle this scenario?

2018-02-16 Thread ayan guha
** You do NOT need dataframes, I mean. On Sat, Feb 17, 2018 at 3:58 PM, ayan guha wrote: > Hi > > Couple of suggestions: > > 1. Do not use Dataset, use Dataframe in this scenario. There is no benefit > of dataset features here. Using Dataframe, you can write an arbitrary

Re: Can spark handle this scenario?

2018-02-16 Thread ayan guha
random > operation on each record. Is Spark good at handling such scenario? > > > 2. Regarding the compilation error, any fix? I did not find a satisfactory > solution online. > > > Thanks for help! > > > > > > -- Best Regards, Ayan Guha

Re: Pyspark Error: Unable to read a hive table with transactional property set as 'True'

2018-03-02 Thread ayan guha
`restr_trk_cls` string, > | > | `tst_hist_cd` string, > | > | `cret_ts` string, > | > | `ylw_grp_nbr` int, > | > | `geo_dfct_grp_nme` string, > | > | `supv_rollup_cd` string, > | > | `dfct_stat_cd` string, > | > | `lst_maint_id` string, > | > | `del_rsn_cd` string, > | > | `umt_prcs_user_id` string, > | > | `gdfct_vinsp_srestr` string, > | > | `gc_opr_init` string) > | > | CLUSTERED BY ( > | > | geo_car_nme) > | > | INTO 2 BUCKETS > | > | ROW FORMAT SERDE > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' >| > | STORED AS INPUTFORMAT > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' >| > | OUTPUTFORMAT > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > | > | LOCATION > | > | 'hdfs://HADOOP02/apps/hive/warehouse/load_etl.db/trpt_ > geo_defect_prod_dec07_del_blank' | > | TBLPROPERTIES ( > | > | 'numFiles'='4', > | > | 'numRows'='0', > | > | 'rawDataSize'='0', > | > | 'totalSize'='2566942', > | > | 'transactional'='true', > | > | 'transient_lastDdlTime'='1518695199') >| > +--- > +--+ > > > Thanks, > D > -- Best Regards, Ayan Guha

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread ayan guha
gt; table) based on some job condition(ex. Job1). There are multiple tables >> like this, so column and the table names are decided only run time. So I >> can't do type conversion explicitly when read from Hive. >> >> So is there any utility/api available in Spark to achieve this conversion >> issue? >> >> >> Thanks, >> Guru >> > -- Best Regards, Ayan Guha

Re: Need config params while doing rdd.foreach or map

2018-03-22 Thread ayan guha
nication in error, please notify > us immediately by responding to this email and then delete it from your > system. The firm is neither liable for the proper and complete transmission > of the information contained in this communication nor for any delay in its > receipt. > -- Best Regards, Ayan Guha

Re: Issue with map function in Spark 2.2.0

2018-04-11 Thread ayan guha
02) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at > org.apache.spark.api.python.PythonRDD$$anonfun$1.apply(PythonRDD.scala:141) > at > org.apache.spark.api.python.PythonRDD$$anonfun$1.apply(PythonRDD.scala:141) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > ... 1 more > > > > Please help me in this . Thanks. Nandan Priyadarshi > > -- Best Regards, Ayan Guha

Re: How to bulk insert using spark streaming job

2018-04-19 Thread ayan guha
: > How to bulk insert using spark streaming job > > Sent from my iPhone > -- Best Regards, Ayan Guha

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread ayan guha
t; read a single partition and give me the schema of that partition and > consider it as the schema of the whole dataframe" ? (I don't care about > schema merge, it's off by the way) > > Thanks. > Walid. > -- Best Regards, Ayan Guha

Re: Read or save specific blocks of a file

2018-05-03 Thread ayan guha
ntents of each block > (or of one block) into separate files? > > Thank you very much, > Thodoris > > > - > To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org > For additional commands, e-mail: user-h...@hadoop.apache.org > > > > -- Best Regards, Ayan Guha

Re: native-lzo library not available

2018-05-03 Thread ayan guha
til.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> Driver stacktrace: >> >> >> >> I see the LZO at GPextras: >> >> ll >> total 104 >> -rw-r--r-- 1 cloudera-scm cloudera-scm 35308 Oct 4 2017 >> COPYING.hadoop-lzo >> -rw-r--r-- 1 cloudera-scm cloudera-scm 62268 Oct 4 2017 >> hadoop-lzo-0.4.15-cdh5.13.0.jar >> lrwxrwxrwx 1 cloudera-scm cloudera-scm31 May 3 07:23 hadoop-lzo.jar >> -> hadoop-lzo-0.4.15-cdh5.13.0.jar >> drwxr-xr-x 2 cloudera-scm cloudera-scm 4096 Oct 4 2017 native >> >> >> >> >> -- >> Take Care >> Fawze Abujaber >> > > > > -- > Take Care > Fawze Abujaber > -- Best Regards, Ayan Guha

Re: Submit many spark applications

2018-05-16 Thread ayan guha
m are running. > Any > > suggestions on how to solve this problem? Thank you! > > > > -- > Marcelo > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread ayan guha
kip these nonexistent file >> path, and continues to run. I have tried python HDFSCli api to check the >> existence of path , but hdfs cli cannot support wildcard. >> >> >> >> Any good idea to solve my problem? Thanks~ >> >> >> >> Regard, >> Junfeng Chen >> > > -- Best Regards, Ayan Guha

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread ayan guha
; - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> > > -- > Thanks, > Ajay > -- Best Regards, Ayan Guha

Re: Append In-Place to S3

2018-06-03 Thread ayan guha
, >>>>> target, >>>>> target_id >>>>> FROM new_data i >>>>> LEFT ANTI JOIN existing_data im >>>>> ON i.source = im.source >>>>> AND i.source_id = im.source_id >>>>> AND i.target = im.target >>>>> AND i.target = im.target_id >>>>> """ >>>>> ) >>>>> append_df.coalesce(1).write.parquet(existing_data_path, mode='append', >>>>> compression='gzip’) >>>>> >>>>> >>>>> I thought this would append new rows and keep the data unique, but I >>>>> am see many duplicates. Can someone help me with this and tell me what I >>>>> am >>>>> doing wrong? >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>> -- Best Regards, Ayan Guha

Spark-Mongodb connector issue

2018-06-18 Thread ayan guha
-- Best Regards, Ayan Guha

Re: the best tool to interact with Spark

2018-06-26 Thread ayan guha
t; Thank you, > Donni > -- Best Regards, Ayan Guha

Big Burst of Streaming Changes

2018-07-29 Thread ayan guha
tool, is there any spark package which can actually listen to Oracle change stream? So can we use spark as the CDC tool itself? -- Best Regards, Ayan Guha

Re: Drawing Big Data tech diagrams using Pen Tablets

2018-09-12 Thread ayan guha
;>>>> that is clumsy and expensive. >>>>> >>>>> I was wondering if anyone has tried some tools like Wacom Intuos Pro >>>>> Paper Edition Pen Tablet >>>>> <https://www.wacom.com/en/products/pen-tablets/wacom-intuos-pro-paper> >>>>> or any similar tools for easier drawing and their recommendation? >>>>> >>>>> Thanks, >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>> -- Best Regards, Ayan Guha

Re: Time-Series Forecasting

2018-09-19 Thread ayan guha
2018, at 18:01, Mina Aslani wrote: >> > >> > Hi, >> > I have a question for you. Do we have any Time-Series Forecasting >> library in Spark? >> > >> > Best regards, >> > Mina >> > -- Best Regards, Ayan Guha

Re: Pyspark Partitioning

2018-09-30 Thread ayan guha
Point1 >> 2| id2| Point2 >> >> I want to have a partition for every Group_Id and apply on every >> partition a function defined by me. >> I have tried with partitionBy('Group_Id').mapPartitions() but i receive >> error. >> Could you please advice me how to do it? >> > -- Best Regards, Ayan Guha

SparkR issue

2018-10-08 Thread ayan guha
uld we speed up this bit? 2. We understand better options would be to read data from external sources, but we need this data to be generated for some simulation purpose. Whats possibly going wrong? Best Ayan -- Best Regards, Ayan Guha

Re: Where does the Driver run?

2019-03-29 Thread ayan guha
with “my-app" > > val jars = listJars(“/path/to/lib") > > conf.setJars(jars) > > … > > > > When I launch the job I see 2 executors running on the 2 workers/slaves. > Everything seems to run fine and sometimes completes successfully. Frequent > failures are the reason for this question. > > > > Where is the Driver running? I don’t see it in the GUI, I see 2 Executors > taking all cluster resources. With a Yarn cluster I would expect the > “Driver" to run on/in the Yarn Master but I am using the Spark Standalone > Master, where is the Drive part of the Job running? > > > > If is is running in the Master, we are in trouble because I start the > Master on one of my 2 Workers sharing resources with one of the Executors. > Executor mem + driver mem is > available mem on a Worker. I can change this > but need so understand where the Driver part of the Spark Job runs. Is it > in the Spark Master, or inside and Executor, or ??? > > > > The “Driver” creates and broadcasts some large data structures so the need > for an answer is more critical than with more typical tiny Drivers. > > > > Thanks for you help! > > > > > -- > > Cheers! > > > > -- Best Regards, Ayan Guha

Re: How to retrieve multiple columns values (in one row) to variables in Spark Scala method

2019-04-05 Thread ayan guha
Try like this: val primitiveDS = spark.sql("select 1.2 avg ,2.3 stddev").collect().apply(0) val arr = Array(primitiveDS.getDecimal(0), primitiveDS.getDecimal(1)) primitiveDS: org.apache.spark.sql.Row = [1.2,2.3] arr: Array[java.math.BigDecimal] = Array(1.2, 2.3)

Re: Databricks - number of executors, shuffle.partitions etc

2019-05-15 Thread ayan guha
x27;t take into account any spark.conf.set properties... it creates 8 >> worker nodes (dat executors) but doesn't honor the supplied conf >> parameters. Any idea? >> >> -- >> Regards, >> >> Rishi Shah >> > > > -- > Regards, > > Rishi Shah > -- Best Regards, Ayan Guha

Re: Announcing Delta Lake 0.2.0

2019-06-19 Thread ayan guha
gt;> >>> Best regards, >>> Liwen Sun >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Delta Lake Users and Developers" group. >>> To unsubscribe from this group and stop receiving emails fr

Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread ayan guha
Delta Lake to read and write >>> data on cloud storage services such as Amazon S3 and Azure Blob Storage. >>> For configuration instructions, please see: >>> https://docs.delta.io/0.2.0/delta-storage.html >>> >>> *Improved concurrency* >>> Delta Lake now allows concurrent append-only writes while still ensuring >>> serializability. For concurrency control in Delta Lake, please see: >>> https://docs.delta.io/0.2.0/delta-concurrency.html >>> >>> We have also greatly expanded the test coverage as part of this release. >>> >>> We would like to acknowledge all community members for contributing to >>> this release. >>> >>> Best regards, >>> Liwen Sun >>> >>> -- Best Regards, Ayan Guha

Re: Announcing Delta Lake 0.2.0

2019-06-21 Thread ayan guha
this workaround, other than during the time when the batch is running the table can provide some wrong information? Best Ayan On Fri, Jun 21, 2019 at 8:03 PM Tathagata Das wrote: > @ayan guha @Gourav Sengupta > > Delta Lake is OSS currently does not support defining tables in Hive &g

Re: Learning Spark

2019-07-04 Thread ayan guha
;> I am new Spark learner. Can someone guide me with the strategy towards >> getting expertise in PySpark. >> >> Thanks!!! >> > -- Best Regards, Ayan Guha

Re: Convert a line of String into column

2019-10-05 Thread ayan guha
gt;> val query = words. >> writeStream. >> outputMode("append"). >> format("console"). >> start >> query.awaitTermination() >> >> But in fact this code only turns the line into a single column >> >> +---+ >> | value| >> +---+ >> |col1...| >> |col2...| >> | col3..| >> | ... | >> | col6 | >> +--+ >> >> How to achieve the effect that I want to do? >> >> Thanks? >> >> -- Best Regards, Ayan Guha

Re: Driver vs master

2019-10-07 Thread ayan guha
allocated for the driver regardless of how many threads you have. >>> >>> >>> So if we will run more jobs then we need more memory on master. Please >>>> correct me if I am wrong. >>>> >>> >>> This depends on your application, but in general more threads will >>> require more memory. >>> >>> >>> >>>> >>>> Thanks >>>> Amit >>>> >>> -- >>> It's dark in this basement. >>> >> -- > It's dark in this basement. > -- Best Regards, Ayan Guha

Fwd: Delta with intelligent upsett

2019-10-31 Thread ayan guha
be an obvious performance booster. Any thoughts? -- Best Regards, Ayan Guha -- Best Regards, Ayan Guha

Re: Avro file question

2019-11-04 Thread ayan guha
> >> I’m assuming that a single large avro file can also be split into >> multiple mappers/reducers/executors during processing. >> >> Thanks. >> > -- Best Regards, Ayan Guha

Re: Explode/Flatten Map type Data Using Pyspark

2019-11-14 Thread ayan guha
ues).The number of key values > will be different for each event id.so i want to flatten the records for > all > the map type(key values) as below > > eve_id k1 k2 k3 > 001abc xyz 10091 > > eve_id, k1 k2 k3 k4 > 002, 12 jack 0.01 0998 > > eve_id, k1 k2k3 k4 k5 > 003, aaa device endpoint - > > > Thanks > Anbu > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Explode/Flatten Map type Data Using Pyspark

2019-11-14 Thread ayan guha
ack 0.01 0998 > > eve_id, k1 k2k3 k4 k5 > 003, aaa device endpoint - > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Is there a merge API available for writing DataFrame

2019-11-15 Thread ayan guha
equivalent of Hive's MERGE INTO but maybe we can have a way of > writing (updating) only those rows present in the DF, with the rest of the > rows/data in the sink untouched. > > Sivaprasanna > -- Best Regards, Ayan Guha

Re: How more than one spark job can write to same partition in the parquet file

2019-12-11 Thread ayan guha
No we faced problem with that setup. On Thu, 12 Dec 2019 at 11:14 am, Chetan Khatri wrote: > Hi Spark Users, > would that be possible to write to same partition to the parquet file > through concurrent two spark jobs with different spark session. > > thanks > -- Best Regards, Ayan Guha

Re: How more than one spark job can write to same partition in the parquet file

2019-12-11 Thread ayan guha
We partitioned data logically for 2 different jobs...in our use case based on geography... On Thu, 12 Dec 2019 at 3:39 pm, Chetan Khatri wrote: > Thanks, If you can share alternative change in design. I would love to > hear from you. > > On Wed, Dec 11, 2019 at 9:34 PM ayan guha wr

Re: Identify bottleneck

2019-12-19 Thread ayan guha
zation overload > either). > I also noticed that memory storage is never used during the execution. I > know from several hours of research that bz2 is the only real compression > algorithm usable as an input in spark for parallelization reasons. > > Do you have any idea of why such a behaviour ? > and do you have any idea on how to improve such treatment ? > > Cheers > > Antoine > > > -- Best Regards, Ayan Guha

Re: Identify bottleneck

2019-12-20 Thread ayan guha
t isn't any better. The logic all gets processed by > the same engine so to confirm, compare the DAGs generated from both > approaches and see if they're identical. > > On Fri, 20 Dec 2019, 8:56 am ayan guha, wrote: > >> Quick question: Why is it better to use one sql

Re: Performance tuning on the Databricks pyspark 2.4.4

2020-01-21 Thread ayan guha
can set this as per your > intended > parallelism and your available resources. > > > > >start = time.time() > pool = multiprocessing.Pool(20) > # This will execute get_counts() parallel, on each element inside > input_paths. > # result (a list of dictionary) is constructed when all executions are > completed. > //result = pool.map(get_counts, input_paths) > > end = time.time() > > result_df = pd.DataFrame(result) > # You can use, result_df.to_csv() to store the results in a csv. > print(result_df) > print('Time take : {}'.format(end - start)) > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread ayan guha
ashParittioning + >> shuffle during join. All the table join contain `Country` column with some >> extra column based on the table. >> >> Is there any way to avoid these shuffles ? and improve performance ? >> >> >> Thanks and regards >> Manjunath >> > -- Best Regards, Ayan Guha

Issue with UDF Int Conversion - Str to Int

2020-03-22 Thread ayan guha
+---+ |value_to_hash|1 |2 | +-+--+---+ |4104003141 |478797741 |478797741 | |4102859263 |2520346415|-1774620881| +-+--+---+ The function working fine, as shown in the print statement. However values are not matching and vary widely. Any pointer? -- Best Regards, Ayan Guha

Re: Issue with UDF Int Conversion - Str to Int

2020-03-23 Thread ayan guha
quot; * >> >> 2520346415 +-+--+---+ |value_to_hash|1 |2 | >> +-+--+---+ |4104003141 |478797741 |478797741 | >> |4102859263 >> |2520346415|-1774620881| +-+--+---+ >> >> The function working fine, as shown in the print statement. However >> values are not matching and vary widely. >> >> Any pointer? >> >> -- >> Best Regards, >> Ayan Guha >> > -- Best Regards, Ayan Guha

Re: Issue with UDF Int Conversion - Str to Int

2020-03-23 Thread ayan guha
ery plan: > > >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, > 10).cast("long").alias("sha2long")).explain() > == Physical Plan == > Union > :- *(1) Project [478797741 AS sha2long#74L] > : +- Scan OneRowRelation[] &

Re: OFF TOPIC LIST CRITERIA

2020-03-28 Thread ayan guha
software. >> > >> > He has not reported any bugs while I have reported so many in such a >> short space of time. >> > He has warned me as well >> > >> > So that Sean Owen does not put a barrier in place for me in my path to >> free learning and free Apache software >> > I would like somebody to clarify the criteria for me. >> > >> > >> > Backbutton.co.uk >> > ¯\_(ツ)_/¯ >> > ♡۶Java♡۶RMI ♡۶ >> > Make Use Method {MUM} >> > makeuse.org >> > -- Best Regards, Ayan Guha

Re: How to pass a constant value to a partitioned hive table in spark

2020-04-16 Thread ayan guha
gt; | """ >> > sqltext: String = >> > $INSERT INTO TABLE michtest.BroadcastStaging PARTITION (broadcastId = >> > $broadcastValue, brand = "dummy") >> > SELECT >> > ocis_party_id AS partyId >> > , target_mobile_no AS phoneNumber >> > FROM tmp >> > >> > >> > scala> spark.sql($sqltext) >> > :41: error: not found: value $sqltext >> > spark.sql($sqltext) >> > >> > >> > Any ideas? >> > >> > >> > Thanks >> > >> > >> > Dr Mich Talebzadeh >> > >> > >> > >> > LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> > < >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >* >> > >> > >> > >> > http://talebzadehmich.wordpress.com >> > >> > >> > *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any >> > loss, damage or destruction of data or any other property which may >> arise >> > from relying on this email's technical content is explicitly disclaimed. >> > The author will in no case be liable for any monetary damages arising >> from >> > such loss, damage or destruction. >> > -- Best Regards, Ayan Guha

Re: Spark :- Update record in partition.

2020-06-07 Thread ayan guha
date the row and overwrite the partition? > > Is there a way to only update 1 row like DBMS. Otherwise 1 row update > takes a long time to rewrite the whole partition ? > > Thanks > Sunil > > > > > -- Best Regards, Ayan Guha

Spark 1.3.1 - SQL Issues

2015-04-20 Thread ayan guha
op2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", line 127, in _prepare assert isinstance(ratings, RDD), "ratings should be RDD" AssertionError: ratings should be RDD -- Best Regards, Ayan Guha

Re: SparkSQL performance

2015-04-20 Thread ayan guha
SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" < renatoj.mar

Re: Updating a Column in a DataFrame

2015-04-20 Thread ayan guha
You can always create another DF using a map. In reality operations are lazy so only final value will get computed. Can you provide the usecase in little more detail? On 21 Apr 2015 08:39, "ARose" wrote: > In my Java application, I want to update the values of a Column in a given > DataFrame. Ho

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread ayan guha
I think recommended use will be creating a dataframe using hbase as source. Then you can run any SQL on that DF. In 1.2 you can create a base rdd and then apply schema in the same manner On 21 Apr 2015 03:12, "Jeetendra Gangele" wrote: > Thanks for reply. > > Does phoenix using inside Spark will

Re: invalidate caching for hadoopFile input?

2015-04-20 Thread ayan guha
You can use rdd.unpersist. its documented in spark programming guide page under Removing Data section. Ayan On 21 Apr 2015 13:16, "Wei Wei" wrote: > Hey folks, > > I am trying to load a directory of avro files like this in spark-shell: > > val data = sqlContext.avroFile("hdfs://path/to/dir/*").c

Re: Map-Side Join in Spark

2015-04-20 Thread ayan guha
In my understanding you need to create a key out of the data and repartition both datasets to achieve map side join. On 21 Apr 2015 14:10, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > Can someone share their working code of Map Side join in Spark + Scala. > (No Spark-SQL) > > The only resource i could find was this

Re: Map-Side Join in Spark

2015-04-20 Thread ayan guha
If you are using a pairrdd, then you can use partition by method to provide your partitioner On 21 Apr 2015 15:04, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > What is re-partition ? > > On Tue, Apr 21, 2015 at 10:23 AM, ayan guha wrote: > >> In my understanding you need to crea

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread ayan guha
ataFrame.groupBy > <http://apache-spark-user-list.1001560.n3.nabble.com/Column-renaming-after-DataFrame-groupBy-tp22586.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > -- Best Regards, Ayan Guha

Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
ng fine in 1.2.0 (till last night :)) Any solution? I am thinking to map the training dataframe back to a RDD, byt will lose the schema information. Best Ayan On Mon, Apr 20, 2015 at 10:23 PM, ayan guha wrote: > Hi > Just upgraded to Spark 1.3.1. > > I am getting an warning >

Re: When the old data dropped from the cache?

2015-04-21 Thread ayan guha
t;> data partitions in a least-recently-used (LRU) fashion." > > > Can it be understood that the cache will be automatically refreshed with > new data. If yes when and how? How Spark determines the old data? > > Regards. > -- Best Regards, Ayan Guha

Re: Map-Side Join in Spark

2015-04-21 Thread ayan guha
; Can you elaborate ? > > On Tue, Apr 21, 2015 at 11:18 AM, ayan guha wrote: > >> If you are using a pairrdd, then you can use partition by method to >> provide your partitioner >> On 21 Apr 2015 15:04, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: >> >>> What is re-pa

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
s, > however it does have n*n rows. > > That does not happen, when I load df_one and df_two from disk directly. I > am on Spark 1.3.0, but this also happens on the current 1.4.0 snapshot. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Custom Partitioning Spark

2015-04-21 Thread ayan guha
bscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> > > > -- > Regards, > Muhammad Aamir > > > *CONFIDENTIALITY:This email is intended solely for the person(s) named and > may be confidential and/or privileged.If you are not the intended > recipient,please delete it,notify me and do not copy,use,or disclose its > content.* > -- Best Regards, Ayan Guha

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
ctually was > > df_one = df.select('col1', 'col2') > df_two = df.select('col1', 'col3') > > But in Spark 1.4.0 this does not seem to make any difference anyway and > the problem is the same with both versions. > > > &g

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
test/api/scala/index.html#org.apache.spark.ml.recommendation.ALS > > > > In the examples/ directory for ml/, you can find a MovieLensALS example. > > > > Good luck! > > Joseph > > > > On Tue, Apr 21, 2015 at 4:58 AM, ayan guha wrote: > >> &g

Re: Convert DStream to DataFrame

2015-04-22 Thread ayan guha
What about sqlcontext.createDataframe(rdd)? On 22 Apr 2015 23:04, "Sergio Jiménez Barrio" wrote: > Hi, > > I am using Kafka with Apache Stream to send JSON to Apache Spark: > > val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicsSe

Re: Pipeline in pyspark

2015-04-23 Thread ayan guha
I do not think you can share data across spark contexts. So as long as you can pass it around you should be good. On 23 Apr 2015 17:12, "Suraj Shetiya" wrote: > Hi, > > I have come across ways of building pipeline of input/transform and output > pipelines with Java (Google Dataflow/Spark etc). I

Re: Spark SQL performance issue.

2015-04-23 Thread ayan guha
Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, "Nikolay Tikhonov" wrote: > Hi, > I have Spark SQL performance issue. My code contains a simple JavaBean: > > public class Person implements Externalizable { > private int id; >

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-24 Thread ayan guha
What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream data to spark). Also I have seen people using mysql jars to bring data in. Essentially you want to simulate creation of rdd. On 24 Apr 2015 18:15, "sequoiadb" wrote: > If I run spark in s

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha
x27;).toPandas() > > > > > > *Output > > > > Pandas > > month value year > > 0 5100 1993 > > 112200 2005 > > 212300 1994 > > > > month value year > > 012101 1993 > > 112102 1993 > > > > Empty DataFrame > > > > Columns: [month, value_x, year, value_y] > > > > Index: [] > > > > Spark > > month value year > > 0 5100 1993 > > 112200 2005 > > 212300 1994 > > > > month value year > > 012101 1993 > > 112102 1993 > > > > month value year month value year > > 012200 200512102 1993 > > 112200 200512101 1993 > > 212300 199412102 1993 > > 312300 199412101 1993 > > > > It looks like Spark returns some results where an inner join should > > return nothing. > > > > Am I doing the join with two columns in the wrong way? If yes, what is > > the right syntax for this? > > > > Thanks! > > Ali > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha
I just tested your pr On 25 Apr 2015 10:18, "Ali Bajwa" wrote: > Any ideas on this? Any sample code to join 2 data frames on two columns? > > Thanks > Ali > > On Apr 23, 2015, at 1:05 PM, Ali Bajwa wrote: > > > Hi experts, > > > > Sorry if this is a n00b question or has already been answered...

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
y((a, b) => if (a.age > b.age) a else b) > > Thank you! > > Best, > Wenlei > -- Best Regards, Ayan Guha

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
t; C10C10 > C20 C20 > > The desired output would be > Name AgeOther > A 30 A30 > B 15 B15 > C 20 C20 > > Thank you so much for

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-25 Thread ayan guha
(Note that this is different than the Spark SQL JDBC server, which allows other applications to run queries using Spark SQL). On Fri, Apr 24, 2015 at 6:27 PM, ayan guha wrote: > What is the specific usecase? I can think of couple of ways (write to hdfs > and then read from spark or stream d

Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread ayan guha
scribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

directory loader in windows

2015-04-25 Thread ayan guha
run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) -- Best Regards, Ayan Guha

Re: directory loader in windows

2015-04-25 Thread ayan guha
sY = sc.textFile(loc) > print newsY.count() > > On 25 April 2015 at 20:08, ayan guha wrote: > >> Hi >> >> I am facing this weird issue. >> >> I am on Windows, and I am trying to load all files within a folder. Here >> is my code - >> >&g

Re: Querying Cluster State

2015-04-26 Thread ayan guha
ters and Workers that are currently available using API calls and > then take some appropriate action based on the information I get back, like > restart a dead Master or Worker. > > Is this possible? does Spark provide such API? > -- Best Regards, Ayan Guha

Re: Querying Cluster State

2015-04-26 Thread ayan guha
gt; > Regards > > > On Sun, Apr 26, 2015 at 10:12 AM, ayan guha wrote: > >> In my limited understanding, there must be single "leader" master in >> the cluster. If there are multiple leaders, it will lead to unstable >> cluster as each masters will

Re: Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread ayan guha
The answer is it depends :) The fact that query runtime increases indicates more shuffle. You may want to construct rdds based on keys you use. You may want to specify what kind of node you are using and how many executors you are using. You may also want to play around with executor memory alloc

Re: Scalability of group by

2015-04-27 Thread ayan guha
Hi Can you test on a smaller dataset to identify if it is cluster issue or scaling issue in spark On 28 Apr 2015 11:30, "Ulanov, Alexander" wrote: > Hi, > > > > I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] > in Spark 1.3 as follows: > > “select id, time, first(value)

Re: Automatic Cache in SparkSQL

2015-04-27 Thread ayan guha
Spark keeps job in memory by default for kind of performance gains you are seeing. Additionally depending on your query spark runs stages and any point of time spark's code behind the scene may issue explicit cache. If you hit any such scenario you will find those cached objects in UI under storage

Re: New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-28 Thread ayan guha
Alias function not in python yet. I suggest to write SQL if your data suits it On 28 Apr 2015 14:42, "Don Drake" wrote: > https://issues.apache.org/jira/browse/SPARK-7182 > > Can anyone suggest a workaround for the above issue? > > Thanks. > > -Don > > -- > Donald Drake > Drake Consulting > http:

Re: Understanding Spark's caching

2015-04-28 Thread ayan guha
Hi I replied you in SO. If option A had a action call then it should suffice too. On 28 Apr 2015 05:30, "Eran Medan" wrote: > Hi Everyone! > > I'm trying to understand how Spark's cache work. > > Here is my naive understanding, please let me know if I'm missing > something: > > val rdd1 = sc.text

Re: How to add jars to standalone pyspark program

2015-04-28 Thread ayan guha
Its a windows thing. Please escape front slash in string. Basically it is not able to find the file On 28 Apr 2015 22:09, "Fabian Böhnlein" wrote: > Can you specifiy 'running via PyCharm'. how are you executing the script, > with spark-submit? > > In PySpark I guess you used --jars databricks-csv

Re: 1.3.1: Persisting RDD in parquet - "Conflicting partition column names"

2015-04-28 Thread ayan guha
Can you show your code please? On 28 Apr 2015 13:20, "sranga" wrote: > Hi > > I am getting the following error when persisting an RDD in parquet format > to > an S3 location. This is code that was working in the 1.2 version. The > version that it is failing to work is 1.3.1. > Any help is appreci

Re: Initial tasks in job take time

2015-04-28 Thread ayan guha
Are your driver running on the same m/c as master? On 29 Apr 2015 03:59, "Anshul Singhle" wrote: > Hi, > > I'm running short spark jobs on rdds cached in memory. I'm also using a > long running job context. I want to be able to complete my jobs (on the > cached rdd) in under 1 sec. > I'm getting

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread ayan guha
I guess what you mean is not streaming. If you create a stream context at time t, you will receive data coming through starting time t++, not before time t. Looks like you want a queue. Let Kafka write to a queue, consume msgs from the queue and stop when queue is empty. On 29 Apr 2015 14:35, "dg

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread ayan guha
m to get this to > work using df2's column as a base > > Any idea ? > > Regards, > > Olivier. > -- Best Regards, Ayan Guha

Re: DataFrame filter referencing error

2015-04-29 Thread ayan guha
> at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > Does filter work only on columns of the integer type? What is the exact > behaviour of the filter function and what is the best way to handle the > query I am trying to execute? > > Thank you, > Francesco > > -- Best Regards, Ayan Guha

Re: How to group multiple row data ?

2015-04-29 Thread ayan guha
can't wrap my head around this. >> Can >> anyone help. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-group-multiple-row-data-tp22701.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- Best Regards, Ayan Guha

Re: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread ayan guha
Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" wrote: > I have multiple DataFrame objects each stored in a parquet file. The > DataFrame just contains 3 columns (id, value, timeStamp). I need to union > all the DataFra

Re: Compute pairwise distance

2015-04-29 Thread ayan guha
This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, "Driesprong, Fokko" wrote: > Dear Spark

<    2   3   4   5   6   7   8   >