Re: Json Parsing.

2017-12-06 Thread ayan guha
On Thu, 7 Dec 2017 at 11:37 am, ayan guha <guha.a...@gmail.com> wrote: > You can use get_json function > > On Thu, 7 Dec 2017 at 10:39 am, satyajit vegesna < > satyajit.apas...@gmail.com> wrote: > >> Does spark support automatic detection of schema from a json str

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Yes, my thought exactly. Kindly let me know if you need any help to port in pyspark. On Mon, Nov 6, 2017 at 8:54 AM, Nicolas Paris <nipari...@gmail.com> wrote: > Le 05 nov. 2017 à 22:46, ayan guha écrivait : > > Thank you for the clarification. That was my understanding too

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
n column is provided? > > No, in this case, each worker send a jdbc call accordingly to > documentation > https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases > > -- Best Regards, Ayan Guha

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
--- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread ayan guha
ight_id, " > + "case when length(a.id)/2 = '8' then a.desc else ' ' end as > level_eight_name, " > + "case when length(a.id)/2 = '9' then a.id > else ' ' end as level_nine_id, " > + "case when length(a.id)/2 = '9' then a.desc else ' ' end as > level_nine_name, " > + "case when length(a.id)/2 = '10' then a.id else ' ' end as > level_ten_id, " > + "case when length(a.id)/2 = '10' then a.desc > else ' ' end as level_ten_name " > + "from CategoryTempTable a") > > > Can someone help me in also populating all the parents levels in the > respective level ID and level name, please? > > > Thanks, > Aakash. > > > -- Best Regards, Ayan Guha

Re: Structured streaming with event hubs

2017-10-27 Thread ayan guha
, > Asmath > > Sent from my iPhone > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: parition by multiple columns/keys

2017-10-17 Thread ayan guha
sible to map rows belong to unique combination inside an iterator? > > e.g > > col1 col2 col3 > a 1 a1 > a 1 a2 > b 2 b1 > b 2 b2 > > how can I separate rows with col1 and col2 = (a,1) and (b,2)? > > regards, > Imran > > -- > I.R > -- Best Regards, Ayan Guha

Re: Database insert happening two times

2017-10-17 Thread ayan guha
daJson); >> val updatelambdaReq:InvokeRequest = new InvokeRequest(); >> updatelambdaReq.setFunctionName(updateFunctionName); >> updatelambdaReq.setPayload(updatedLambdaJson.toString()); >> System.out.println("Calling lambda to add log"); >

Re: Database insert happening two times

2017-10-17 Thread ayan guha
ds appended, time etc. Instead of one single entry in the database, > multiple entries are being made to it. Is it because of parallel execution > of code in workers? If it is so then how can I solve it so that it only > writes once. > > *Thanks!* > > *Cheers!* > > Harsh Choudhary > -- Best Regards, Ayan Guha

Re: Happy Diwali to those forum members who celebrate this great festival

2017-10-16 Thread ayan guha
will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards, Ayan Guha

Re: How to flatten a row in PySpark

2017-10-12 Thread ayan guha
gt;> ABZ|ABZ|AF|3|2|730 >> ABZ|ABZ|AF|3|3|730 >> . >> . >> . >> ABZ|ABZ|AF|Y|4|730 >> ABZ|ABZ|AF||Y|5|730 >> >> Basically, I want to consider the various combinations of the 4th and 5th >> columns (where the values are delimited by commas) and accordingly generate >> the above rows from a single row. Please can you suggest me for a good way >> of acheiving this. Thanks in advance ! >> >> Regards, >> >> Debu >> > > -- Best Regards, Ayan Guha

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
> 13); > > Is there any other param that needs to be set as well? > > Thanks > > On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <guha.a...@gmail.com> wrote: > >> I have not tested this, but you should be able to pass on any map-reduce >> like conf to underlying

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
while reading from HDFS itself > instead of using repartition() etc., > > > > Any suggestions are helpful! > > > > Thanks > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread ayan guha
ribe e-mail: user-unsubscr...@spark.apache.org > > > The jdbc will load data into the driver node, this may slow down the > speed,and may OOM. > > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: [Spark SQL] Missing data in Elastisearch when writing data with elasticsearch-spark connector

2017-10-09 Thread ayan guha
ark says it finished the job and saved the data? > 2. What can we do to ensure that we write data to ES in a consistent > manner? > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-04 Thread ayan guha
to memory errors on very huge datasets. >> >> >> Anybody knows or can point me to relevant documentation ? >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- Best Regards, Ayan Guha

Re: Multiple filters vs multiple conditions

2017-10-03 Thread ayan guha
; t.x=1 && t.y=2) > > > > Approach 2: > > .filter (t-> t.x=1) > > .filter (t-> t.y=2) > > > > Is there a difference or one is better than the other or both are same? > > > > Thanks! > > Ahmed Mahmoud > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Replicating a row n times

2017-09-28 Thread ayan guha
ache.spark.sql.functions._ > val result = singleRowDF > .withColumn("dummy", explode(array((1 until 100).map(lit): _*))) > .selectExpr(singleRowDF.columns: _*) > > How can I create a column from an array of values in Java and pass it to > explode function? Suggestions are helpful. > > > Thanks > Kanagha > -- Best Regards, Ayan Guha

Re: More instances = slower Spark job

2017-09-28 Thread ayan guha
ot use parallelism. >> > > This is not true, unless the file is small or is gzipped (gzipped files > cannot be split). > -- Best Regards, Ayan Guha

Re: Debugging Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-09-26 Thread ayan guha
> > > > > I probably missed something super obvious, but can’t find it… > > > > Any help/hint is welcome! - TIA > > > > jg > > > > > > > -- Best Regards, Ayan Guha

Re: partitionBy causing OOM

2017-09-25 Thread ayan guha
, >> adding to that partitioning by date (in daily ETLs for instance) would >> probably cause a data skew (right?), but why am I getting OOMs? Isn't Spark >> supposed to spill to disk if the underlying RDD is too big to fit in memory? >> >> If I'm not using "partitionBy" with the writer (still exploding) >> everything works fine. >> >> This happens both in EMR and in local (mac) pyspark/spark shell (tried >> both in python and scala). >> >> Thanks! >> >> >> > -- Best Regards, Ayan Guha

Re: Amazon Elastic Cache + Spark Streaming

2017-09-22 Thread ayan guha
tried amazon elastic cache.Just give me some pointers. Thanks! > -- Best Regards, Ayan Guha

Re: How to pass sparkSession from driver to executor

2017-09-21 Thread ayan guha
t;>>> >>>> # >>>> >>>> val spark = SparkSession.builder().appName("SampleJob").config("spark. >>>> master", "local") .getOrCreate() >>>> >>>> val df = this is dataframe which has list of file names (hdfs) >>>> >>>> df.foreach { fileName => >>>> >>>> *spark.read.json(fileName)* >>>> >>>> .. some logic here >>>> } >>>> >>>> # >>>> >>>> >>>> *spark.read.json(fileName) --- this fails as it runs in executor. When >>>> I put it outside foreach, i.e. in driver, it works.* >>>> >>>> As I am trying to use spark (sparkSession) in executor which is not >>>> visible outside driver. But I want to read hdfs files inside foreach, how >>>> do I do it. >>>> >>>> Can someone help how to do this. >>>> >>>> Thanks, >>>> Chackra >>>> >>> >>> >> > -- Best Regards, Ayan Guha

Re: Nested RDD operation

2017-09-19 Thread ayan guha
;).rdd.map(_.getString(0 >> ).split(",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList) >> //val oneRow = >> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0)) >> >> val severalRows = rddX.map(row => { >> // Split array into n tools >> println("ROW: " + row(0).toString) >> println(row(0).getClass) >> println("PRINT: " + >> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0 >> ))).toDF("event_name")).select("eventIndex").first().getDouble(0)) >> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq >> (row)).toDF("event_name")).select("eventIndex").first().getDouble(0), Seq >> (row).toString) >> }) >> // attempting >> >> >> -- Best Regards, Ayan Guha

Re: How to convert Row to JSON in Java?

2017-09-10 Thread ayan guha
> wrote: >> >>> toJSON on Dataset/DataFrame? >>> >>> -- >>> *From:* kant kodali <kanth...@gmail.com> >>> *Sent:* Saturday, September 9, 2017 4:15:49 PM >>> *To:* user @spark >>> *Subject:* How to convert Row to JSON in Java? >>> >>> Hi All, >>> >>> How to convert Row to JSON in Java? It would be nice to have .toJson() >>> method in the Row class. >>> >>> Thanks, >>> kant >>> >> >> > -- Best Regards, Ayan Guha

Re: Python vs. Scala

2017-09-05 Thread ayan guha
, LLC > 913.938.6685 > > www.massstreet.net > > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData <http://twitter.com/BobLovesData> > > > -- Best Regards, Ayan Guha

Re: update hive metastore in spark session at runtime

2017-09-01 Thread ayan guha
ation, > > usecase: i want to read data from hive one cluster and write to hive on > another cluster > > Please suggest if this can be done? > > > -- Best Regards, Ayan Guha

Re: Checkpointing With NOT Serializable

2017-08-31 Thread ayan guha
Any help on this? On Thu, Aug 31, 2017 at 10:30 AM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > Want to understand a basic issue. Here is my code: > > def createStreamingContext(sparkCheckpointDir: String,batchDuration: Int > ) = { > > val ssc = new Strea

Re: Select entire row based on a logic applied on 2 columns across multiple rows

2017-08-30 Thread ayan guha
;> ..that's the reason I'm planning for mapgroups with function as argument >>> which takes rowiterator ..but not sure if this is the best to implement as >>> my initial dataframe is very large >>> >>> On Tue, Aug 29, 2017 at 10:24 PM ayan guha <guha.a...@gm

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread ayan guha
29, 2017 at 7:50 PM, Burak Yavuz <brk...@gmail.com> wrote: > >> That just gives you the max time for each train. If I understood the >> question correctly, OP wants the whole row with the max time. That's >> generally solved through joins or subqueries, which would be har

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread ayan guha
>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I am wondering what is the easiest and concise way to express the >>>>>> computation below in Spark Structured streaming given that it supports >>>>>> both >>>>>> imperative and declarative styles? >>>>>> I am just trying to select rows that has max timestamp for each >>>>>> train? Instead of doing some sort of nested queries like we normally do >>>>>> in >>>>>> any relational database I am trying to see if I can leverage both >>>>>> imperative and declarative at the same time. If nested queries or join >>>>>> are >>>>>> not required then I would like to see how this can be possible? I am >>>>>> using >>>>>> spark 2.1.1. >>>>>> >>>>>> Dataset >>>>>> >>>>>> TrainDest Time1HK10:001SH12:001 >>>>>>SZ14:002HK13:002SH09:002 >>>>>> SZ07:00 >>>>>> >>>>>> The desired result should be: >>>>>> >>>>>> TrainDest Time1SZ14:002HK13:00 >>>>>> >>>>>> >>>>> >>>> >>> >> > -- Best Regards, Ayan Guha

Re: Select entire row based on a logic applied on 2 columns across multiple rows

2017-08-29 Thread ayan guha
*val *empList = empData.toList >> >> >> *//calculate income has Logic to figureout latest income for an account >> and returns latest income val *income = calculateIncome(empList) >> >> >> *for *(i <- empList) { >> >> *val *row = i >> >> *return new *employee(row.EmployeeID, row.INCOMEAGE , income) >> } >> *return "Done"* >> >> >> >> } >> >> >> >> Is this a better approach or even the right approach to implement the >> same.If not please suggest a better way to implement the same? >> >> >> >> -- >> >> The information contained in this e-mail is confidential and/or >> proprietary to Capital One and/or its affiliates and may only be used >> solely in performance of work or services for Capital One. The information >> transmitted herewith is intended only for use by the individual or entity >> to which it is addressed. If the reader of this message is not the intended >> recipient, you are hereby notified that any review, retransmission, >> dissemination, distribution, copying or other use of, or taking of any >> action in reliance upon this information is strictly prohibited. If you >> have received this communication in error, please contact the sender and >> delete the material from your computer. >> > -- Best Regards, Ayan Guha

Re: Livy with Spark package

2017-08-23 Thread ayan guha
his is equal to --package in > spark-submit. > > BTW you'd better ask livy question in u...@livy.incubator.apache.org. > > Thanks > Jerry > > On Thu, Aug 24, 2017 at 8:11 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> I have a python p

Livy with Spark package

2017-08-23 Thread ayan guha
t not able to understand how/where to configure the "packages" switch...Any help? -- Best Regards, Ayan Guha

Re: Update MySQL table via Spark/SparkR?

2017-08-21 Thread ayan guha
ing table in MySQL >> <https://stackoverflow.com/questions/34643200/spark-dataframes-upsert-to-postgres-table> >> and >> then perform the UPDATE query on the MySQL side. >> >> >> >> Ideally, I’d like to handle the update during the write operation. Has >> anyone else encountered this limitation and have a better solution? >> >> >> >> Thank you, >> >> >> >> Jake >> > > -- Best Regards, Ayan Guha

Re: Spark hive overwrite is very very slow

2017-08-20 Thread ayan guha
gt;>> >>> Hi, >>>> >>> >>>> >>> I have written spark sql job on spark2.0 by using scala . It is >>>> just pulling the data from hive table and add extra columns , remove >>>> duplicates and then write it back to hive again. >>>> >>> >>>> >>> In spark ui, it is taking almost 40 minutes to write 400 go of >>>> data. Is there anything that I need to improve performance . >>>> >>> >>>> >>> Spark.sql.partitions is 2000 in my case with executor memory of >>>> 16gb and dynamic allocation enabled. >>>> >>> >>>> >>> I am doing insert overwrite on partition by >>>> >>> Da.write.mode(overwrite).insertinto(table) >>>> >>> >>>> >>> Any suggestions please ?? >>>> >>> >>>> >>> Sent from my iPhone >>>> >>> >>>> - >>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>> >>>> >>> >>> >> > -- Best Regards, Ayan Guha

Re: How to authenticate to ADLS from within spark job on the fly

2017-08-18 Thread ayan guha
user to be able to read that path using the > credentials fetched above. > > Any help is appreciated. > > Thanks, > Imtiaz > -- Best Regards, Ayan Guha

Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread ayan guha
00.00|1483286400. > My idea is like below: > 1.write a udf > to add the userid to the beginning of every string element > of listForRule77. > 2.use > val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap() > , the result rdd2 maybe what I need. > > My question: Are there any problems in my idea? Is there a better > way to do this ? > > > > > > ThanksBest regards! > San.Luo > > > > -- Best Regards, Ayan Guha

Re: Runnig multiple spark jobs on yarn

2017-08-02 Thread ayan guha
ent on 3 > node spark cluster? > > Android için Outlook <https://aka.ms/ghei36> uygulamasını edinin > > -- Best Regards, Ayan Guha

Re: Logging in RDD mapToPair of Java Spark application

2017-07-30 Thread ayan guha
st. > Log aggregation has not completed or is not enabled. > > Any other way to see my logs? > > Thanks > > John > > > > > -- > *From:* ayan guha <guha.a...@gmail.com> > *Sent:* Sunday, July 30, 2017 10:34 PM > *To:* John Zen

Re: Logging in RDD mapToPair of Java Spark application

2017-07-30 Thread ayan guha
;> I only see the first line which is outside of the 'mapToPair'. I actually >> have verified my 'mapToPair' is called and the statements after the second >> logging line were executed. The only issue for me is why the second >> logging >> is not in JobTracker UI. >> >> Appreciate your help. >> >> Thanks >> >> John >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Logging-in-RDD-mapToPair-of-Java-Spark-application-tp29007.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- Best Regards, Ayan Guha

OrderedDict to DF

2017-07-30 Thread ayan guha
: long (nullable = true) |-- START_DATE: string (nullable = true) Is there any way to do it? I have control over to use OrderedDict vs normal dict, but the column order is the requirement. Any help would be great!! -- Best Regards, Ayan Guha

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread ayan guha
rkSession.builder().enableHiveSupport().getOrCreate() >> >> *spark.sqlContext.setConf("hive.exec.max.dynamic.partitions", "2000")* >> >> Please help with alternate workaround ! >> >> Thanks >> >> > -- Best Regards, Ayan Guha

Re: Create static Map Type column

2017-07-26 Thread ayan guha
d1 -> fie...| | 3|Map(field1 -> fie...| | 4|Map(field1 -> fie...| +---++ Thanks for creating such useful functions, spark devs :) Best Ayan On Thu, Jul 27, 2017 at 2:30 PM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > I want to create a s

Create static Map Type column

2017-07-26 Thread ayan guha
n is). Is there any other simpler way to achieve this? Probably using withColumn API? I saw a similar post here <https://stackoverflow.com/questions/44223751/how-to-add-empty-map-type-column-to-dataframe>but not able to use the trick Jacek suggested. -- Best Regards, Ayan Guha

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread ayan guha
.ForeachWriter> >>> is >>> supported in Scala. BUT this doesn't seem like an efficient approach and >>> adds deployment overhead because now I will have to support Scala in my app. >>> >>> Another approach is obviously to use Scala instead of python, which is >>> fine but I want to make sure that I absolutely cannot use python for this >>> problem before I take this path. >>> >>> Would appreciate some feedback and alternative design approaches for >>> this problem. >>> >>> Thanks. >>> >>> >>> >>> >> > -- Best Regards, Ayan Guha

Re: Querying Drill with Spark DataFrame

2017-07-22 Thread ayan guha
try" FROM > (SELECT * FROM dfs.root.`output.parquet`) AS Customers ) LIMIT 0 > > Now, the Encountered quote is at "CustomerID" in the query. > > I tried to run the following query in Drill shell: > > SELECT "CustomerID" from dfs.root.`output.parquet`; > > It gives the same error of 'Encountered "\"" '. > > I want to ask if there is any way to remove the above "SELECT > "CustomerID","First_name","Last_name","Email","Gender","Country" FROM" from > the above query formulated by Spark and pushed down to Apache Drill via > JDBC driver. > > Or any other way around like removing the Quotes? > > > Thanks, > > Luqman > -- Best Regards, Ayan Guha

Re: Spark Data Frame Writer - Range Partiotioning

2017-07-21 Thread ayan guha
ike that? Then also have query > engine respect it > > Thanks, > > Nishit > -- Best Regards, Ayan Guha

Re: Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-20 Thread ayan guha
icallocation=true but it got only one >>>> executor. >>>> >>>> Normally this situation occurs when any of the JOB runs with the >>>> Scheduler.mode= FIFO. >>>> >>>> 1) Have your ever faced this issue if so how to overcome this?. >>>> >>>> I was in the impression that as soon as I submit the JOB B the Spark >>>> Scheduler should distribute/release few resources from the JOB A and share >>>> it with the JOB A in the Round Robin fashion?. >>>> >>>> Appreciate your response !!!. >>>> >>>> >>>> Thanks & Regards, >>>> Gokula Krishnan* (Gokul)* >>>> >>> >>> >> > -- Best Regards, Ayan Guha

Re: Failed to find Spark jars directory

2017-07-20 Thread ayan guha
2017 at 7:42 PM, ayan guha <guha.a...@gmail.com> wrote: > >> You should download a pre built version. The code you have got is source >> code, you need to build it to generate the jar files. >> >> > Hi Ayan, > > Can you please help me understand to build

Re: Failed to find Spark jars directory

2017-07-20 Thread ayan guha
o https://spark.apache.org/downloads.html. > > Any help will be highly appreciable. Thanks in Advance. > > Regards, > > Kaushal > -- Best Regards, Ayan Guha

Re: Spark 2.0 and Oracle 12.1 error

2017-07-20 Thread ayan guha
;) > .master("local[4]") > .getOrCreate(); > > final Properties connectionProperties = new Properties(); > connectionProperties.put("user", *"some_user"*)); > connectionProperties.put("password", "some_pwd")); > >

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread ayan guha
rame --- join / groupBy-agg > question... > <http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-join-groupBy-agg-question-tp28849p28880.html> > > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > -- Best Regards, Ayan Guha

Azure key vault

2017-07-18 Thread ayan guha
Hi Anyone here any exp in integrating spark with azure keyvault? -- Best Regards, Ayan Guha

Re: running spark job with fat jar file

2017-07-17 Thread ayan guha
may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 17 July 2017 at 19:34, ayan guha <guha.a...@gmail.com> wrote: > >

Re: splitting columns into new columns

2017-07-17 Thread ayan guha
schema.substring(0,schema.length-1) > val sqlSchema = StructType(schema.split(",").map(s=>StructField(s, > StringType,false))) > sqlContext.createDataFrame(newDataSet,sqlSchema).show() > > Regards > Pralabh Kumar > > > On Mon, Jul 17, 2017 at 1:55 PM, n

Re: running spark job with fat jar file

2017-07-17 Thread ayan guha
to > exist > >> > in > >> > the same directory on each executor node? > >> > > >> > cheers, > >> > > >> > > >> > > >> > Dr Mich Talebzadeh > >> > > >> > > &g

Re: running spark job with fat jar file

2017-07-17 Thread ayan guha
t; from relying on this email's technical content is explicitly disclaimed. > The > > author will in no case be liable for any monetary damages arising from > such > > loss, damage or destruction. > > > > > > > > -- > Marcelo > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: splitting columns into new columns

2017-07-16 Thread ayan guha
quot;,"\\^")).select($"phon‌​e".getItem(0).as("ph‌​one1"),$"phone".getI‌​tem(1).as("phone2”)) > I though of doing this way but the problem is column are having 100+ > separator between the column values > > > > Thank you, > Nayan > -- Best Regards, Ayan Guha

Re: Iterate over grouped df to create new rows/df

2017-07-10 Thread ayan guha
gain for your time :) > > On Sat, Jul 8, 2017 at 8:06 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> Mostly from SO to find overlapping time, adapted to Spark >> >> >> %pyspark >> from pyspark.sql import Row >> d = [Row(t="2017-0

Re: If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread ayan guha
; > On Thu, Jul 6, 2017 at 5:28 PM, kant kodali <kanth...@gmail.com> wrote: > >> HI All, >> >> I am wondering If I pass a raw SQL string to dataframe do I still get the >> Spark SQL optimizations? why or why not? >> >> Thanks! >> > > -- Best Regards, Ayan Guha

Re: json in Cassandra to RDDs

2017-07-01 Thread ayan guha
le = sc.cassandraTable("test", "ttable") > println(ttable.count) > > Some help please to join both things. Scala or Python code for me it's ok. > Thanks in advance. > Cheers. > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: PySpark working with Generators

2017-06-29 Thread ayan guha
, Saatvik Shah <saatvikshah1...@gmail.com> wrote: > Hey Ayan, > > This isnt a typical text file - Its a proprietary data format for which a > native Spark reader is not available. > > Thanks and Regards, > Saatvik Shah > > On Thu, Jun 29, 2017 at 6:48 PM, ayan

Re: PySpark working with Generators

2017-06-29 Thread ayan guha
sage in context: > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread ayan guha
rk.xml").option( > "rowTag","book").load("books.xml") > > val xData = dfX.registerTempTable("books") > > dfX.printSchema() > > val books_inexp =sqlContext.sql("select title,author from books where > price<10") > > books_inexp.show > > > > Regards, > > Amol > This message contains information that may be privileged or confidential > and is the property of the Capgemini Group. It is intended only for the > person to whom it is addressed. If you are not the intended recipient, you > are not authorized to read, print, retain, copy, disseminate, distribute, > or use this message or any part thereof. If you receive this message in > error, please notify the sender immediately and delete all copies of this > message. > -- Best Regards, Ayan Guha

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread ayan guha
execution.datasources.DataSource.sourceSchema(DataSource.scala:192) >>> >>> at >>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87) >>> >>> at >>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87) >>> >>> at >>> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) >>> >>> at >>> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150) >>> >>> ... 48 elided >>> >>> Caused by: java.lang.ClassNotFoundException: >>> org.apache.kafka.common.serialization.ByteArrayDeserializer >>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381) >>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >>> >>> ... 57 more >>> >>> >>> ++ >>> >>> i have tried building the jar with dependencies, but still face the same >>> error. >>> >>> But when i try to do --package with spark-shell using bin/spark-shell >>> --package org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0 , it works >>> fine. >>> >>> The reason, i am trying to build something from source code, is because >>> i want to try pushing dataframe data into kafka topic, based on the url >>> https://github.com/apache/spark/commit/b0a5cd89097c563e9949d8cfcf84d18b03b8d24c, >>> which doesn't work with version 2.1.0. >>> >>> >>> Any help would be highly appreciated. >>> >>> >>> Regards, >>> >>> Satyajit. >>> >>> >>> >> > -- Best Regards, Ayan Guha

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-28 Thread ayan guha
HBase writer? Or any other way to write continuous data to HBase? best Ayan On Tue, Jun 27, 2017 at 2:15 AM, Weiqing Yang <yangweiqing...@gmail.com> wrote: > For SHC documentation, please refer the README in SHC github, which is > kept up-to-date. > > On Mon, Jun 26, 2017 at

Re: IDE for python

2017-06-27 Thread ayan guha
wrote: > Hi, > I recently switched from scala to python, and wondered which IDE people > are using for python. I heard about pycharm, spyder etc. How do they > compare with each other? > > Thanks, > Shawn > -- Best Regards, Ayan Guha

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-26 Thread ayan guha
ability to configure closure serializer >- HTTPBroadcast >- TTL-based metadata cleaning >- *Semi-private class org.apache.spark.Logging. We suggest you use >slf4j directly.* >- SparkContext.metricsSystem > > Thanks, > > Mahesh > > > > > &g

Re: Spark streaming persist to hdfs question

2017-06-25 Thread ayan guha
f the spark streaming application? > > > Any help would be appreciated. > > Thanks, > Naveen > > > -- Best Regards, Ayan Guha

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-25 Thread ayan guha
wrote: > Yes. > What SHC version you were using? > If hitting any issues, you can post them in SHC github issues. There are > some threads about this. > > On Fri, Jun 23, 2017 at 5:46 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> Is it possible t

HDP 2.5 - Python - Spark-On-Hbase

2017-06-23 Thread ayan guha
st Regards, Ayan Guha

Re: Read Data From NFS

2017-06-13 Thread ayan guha
-of-parallelism and > its spark.default.parallelism setting > > Best, > > On Mon, Jun 12, 2017 at 6:18 AM, ayan guha <guha.a...@gmail.com> wrote: > >> I understand how it works with hdfs. My question is when hdfs is not the >> file sustem, how number of parti

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-13 Thread ayan guha
name). >>> >>> In short, it is a high level SQL-like DSL (Domain Specific Language) on >>> top of Spark. People can use that DSL to write Spark jobs without worrying >>> about Spark internal details. Please check README >>> <https://github.com/uber/uberscriptquery> in the project to get more >>> details. >>> >>> It will be great if I could get any feedback or suggestions! >>> >>> Best, >>> Bo >>> >>> > > -- Best Regards, Ayan Guha

Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread ayan guha
he schema in Hive is different as in parquet file. >> But this is a very strange case, as the same schema works fine for other >> brands, which defined as a partition column, and share the whole Hive >> schema as the above. >> >> If I query like: "select * from tablename where brand='*BrandB*' limit >> 3:", everything works fine. >> >> So is this really caused by the Hive schema mismatch with parquet file >> generated by Spark, or by the data within different partitioned keys, or >> really a compatible issue between Spark/Hive? >> >> Thanks >> >> Yong >> >> >> -- Best Regards, Ayan Guha

Re: Read Data From NFS

2017-06-11 Thread ayan guha
s/29011574/how-does-partitioning-work-for-data-from-files-on-hdfs > > > > Regards, > Vaquar khan > > > On Jun 11, 2017 5:28 AM, "ayan guha" <guha.a...@gmail.com> wrote: > >> Hi >> >> My question is what happens if I have 1 file of say 100g

Re: Read Data From NFS

2017-06-11 Thread ayan guha
> code then you will get 12 partition. > > r = sc.textFile("file://my/file/*") > > Not sure what you want to know about file system ,please check API doc. > > > Regards, > Vaquar khan > > > On Jun 8, 2017 10:44 AM, "ayan guha" <guha.a...@g

Re: Read Data From NFS

2017-06-08 Thread ayan guha
Any one? On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <guha.a...@gmail.com> wrote: > Hi Guys > > Quick one: How spark deals (ie create partitions) with large files sitting > on NFS, assuming the all executors can see the file exactly same way. > > ie, when I run > > r

Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread ayan guha
t;>> .option("driver", "org.apache.hive.jdbc.HiveDriver") >>>> .format("jdbc") >>>> .load() >>>> test.show() >>>> >>>> >>>> Scala version: 2.11 >>>> Spark version: 2.1.0, i also tried 2.1.1 >>>> Hive version: CDH 5.7 Hive 1.1.1 >>>> Hive JDBC version: 1.1.1 >>>> >>>> But this problem available on Hive with later versions, too. >>>> I didn't find anything in mail group answers and StackOverflow. >>>> Could you, please, help me with this issue or could you help me find >>>> correct >>>> solution how to query remote hive from spark? >>>> >>>> Thanks in advance! >>>> >>> >>> >> -- Best Regards, Ayan Guha

Read Data From NFS

2017-06-07 Thread ayan guha
File("hdfs://my/file") Are the input formats used same in both cases? -- Best Regards, Ayan Guha

Re: Edge Node in Spark

2017-06-06 Thread ayan guha
erent >> from Edge node for Hadoop please? >> >> Thanks >> >> -- -- - >> To unsubscribe e-mail: user-unsubscribe@spark.apache. org >> <user-unsubscr...@spark.apache.org> >> >> >> >> >> > -- Best Regards, Ayan Guha

Re: Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread ayan guha
(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:589) > at java.lang.Long.parseLong(Long.java:631) > at > scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276) > at scala.collection.immutable.StringOps.toLong(StringOps.scala:29) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:42) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) > ... 56 elided > > > Any ideas how this can work! > > Thanks > > > > > > > > > > -- Best Regards, Ayan Guha

Re: Using SparkContext in Executors

2017-05-28 Thread ayan guha
numOfSlices) >>>> dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath)) >>>> . >>>> . >>>> . >>>> >>>> In parse(), each dump file is parsed and inserted into database using >>>> SparlSQL. In order to do that, SparkContext is needed in the function parse >>>> to use the sql() method. >>>> >>> >>> > -- Best Regards, Ayan Guha

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-19 Thread ayan guha
uot;native data source"? I couldn't find any mention of this feature in >> the SQL Programming Guide and Google was not helpful. >> >> -- >> Daniel Siegmann >> Senior Software Engineer >> *SecurityScorecard Inc.* >> 214 W 29th Street, 5th Floor >> New York, NY 10001 >> >> -- Best Regards, Ayan Guha

Re: spark cluster performance decreases by adding more nodes

2017-05-17 Thread ayan guha
e ask. > Any cues on why this is happening would be very helpful, been stuck on > this for two days now. Thank you for your time. > > > ***versions*** > > Zeppelin: 0.7.1 > Spark: 2.1.0 > Cassandra: 2.2.9 > Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11 > > *Spark cluster specs* > > 6 vCPUs, 32 GB memory = 1 node > > *Cassandra + Zeppelin server specs* > 8 vCPUs, 52 GB memory > > -- Best Regards, Ayan Guha

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread ayan guha
er_id1 feature100 > > I want to get the result as follow > user_id1 feature1 feature2 feature3 feature4 feature5...feature100 > > Is there a more efficient way except join? > > Thanks! > > Didac Gil de la Iglesia > PhD in Computer Science > didacg...@gmail.com > Spain: +34 696 285 544 > Sweden: +46 (0)730229737 > Skype: didac.gil.de.la.iglesia > > -- Best Regards, Ayan Guha

Re: Spark Shell issue on HDInsight

2017-05-14 Thread ayan guha
spark/issues. HTH! > > On Thu, May 11, 2017 at 9:08 PM ayan guha <guha.a...@gmail.com> wrote: > >> Works for me tooyou are a life-saver :) >> >> But the question: should/how we report this to Azure team? >> >> On Fri, May 12, 2017 at 10:32 AM, Denny L

Re: Spark Shell issue on HDInsight

2017-05-11 Thread ayan guha
.__/\_,_/_/ /_/\_\ version 2.0.2.2.5.4.0-121 > /_/ > > Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> > > HTH! > > > On Wed, May 10,

Re: Spark Shell issue on HDInsight

2017-05-11 Thread ayan guha
tDB connector, > specifically version 0.0.1. Could you run the 0.0.3 version of the jar and > see if you're still getting the same error? i.e. > > spark-shell --master yarn --jars azure-documentdb-spark-0.0.3- > SNAPSHOT.jar,azure-documentdb-1.10.0.jar > > > On Mon, May 8,

Spark Shell issue on HDInsight

2017-05-08 Thread ayan guha
d:~/azure-spark-docdb-test/v1$ uname -a Linux ed0-svochd 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux sshuser@ed0-svochd:~/azure-spark-docdb-test/v1$ -- Best Regards, Ayan Guha

Re: Spark books

2017-05-03 Thread ayan guha
<zemin...@gmail.com> wrote: > >> I'm trying to decide whether to buy the book learning spark, spark for >> machine learning etc. or wait for a new edition covering the new concepts >> like dataframe and datasets. Anyone got any suggestions? >> > > -- Best Regards, Ayan Guha

Re: Azure Event Hub with Pyspark

2017-04-21 Thread ayan guha
/Azure/azure-documentdb-spark) if sticking > with Scala? > On Thu, Apr 20, 2017 at 21:46 Nan Zhu <zhunanmcg...@gmail.com> wrote: > >> DocDB does have a java client? Anything prevent you using that? >> >> Get Outlook for iOS <https://aka.ms/o0ukef> >> --

Re: Azure Event Hub with Pyspark

2017-04-20 Thread ayan guha
hdinsight/spark-eventhubs : which is > eventhub receiver for spark streaming > We are using it but you have scala version only i guess > > > Thanks, > Ashish Singh > > On Fri, Apr 21, 2017 at 9:19 AM, ayan guha <guha.a...@gmail.com> wrote: > >> [image: Boxbe] <

Azure Event Hub with Pyspark

2017-04-20 Thread ayan guha
Hi I am not able to find any conector to be used to connect spark streaming with Azure Event Hub, using pyspark. Does anyone know if there is such library/package exists>? -- Best Regards, Ayan Guha

Re: isin query

2017-04-17 Thread ayan guha
st("m_123","m_111","m_145"):_*)) > count =0 > but > > df.filter($"msrid" isin (List("m_123"):_*)) > count=121212 > > Any suggestion will do a great help to me. > > Thanks, > Nayan > -- Best Regards, Ayan Guha

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread ayan guha
; .filter(checkID($"ID")) > > .select($"ID", $"BINARY") > > .write... > > > > Thanks for any advice! > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-speed-up-further-filtering-tp28605.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > -- Best Regards, Ayan Guha

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread ayan guha
; >>> > >>> > -- >>> > View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Shall-I-use-Apache-Zeppelin-for-data-analytics-visualization-tp28604.html >>> > Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> > >>> > - >>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> > >>> >> >> > -- Best Regards, Ayan Guha

Re: Invalidating/Remove complete mapWithState state

2017-04-17 Thread ayan guha
; bitte sofort den Absender und löschen Sie diese E-Mail und evtl. > beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen > evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist > nicht gestattet > -- Best Regards, Ayan Guha

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread ayan guha
>> I tried below things >>> 1. I'm creating dataframe name & temporary tabel name dynamically based >>> in dataset name. >>> 2. Enabled Spark Dynamic allocation (--conf >>> spark.dynamicAllocation.enabled=true) >>> 3. Set spark.scheduler.mode to FAIR >>> >>> >>> I appreciate advise on >>> 1. Is anything wrong in above implementation? >>> 2. Is it good idea to process those big datasets in parallel in one job? >>> 3. Any other solution to process multiple datasets in parallel? >>> >>> Thank you, >>> Amol Patil >>> >> >> -- Best Regards, Ayan Guha

Re: Memory problems with simple ETL in Pyspark

2017-04-17 Thread ayan guha
audience part of the task finished successfully and the failure > was on a df that didn't touch it, it shouldn't've made a difference. > > Thank you! > > On Sat, Apr 15, 2017 at 9:07 PM, ayan guha <guha.a...@gmail.com> wrote: > >> What i missed is try increasing number

<    1   2   3   4   5   6   7   8   >