On Thu, 7 Dec 2017 at 11:37 am, ayan guha <guha.a...@gmail.com> wrote:
> You can use get_json function
>
> On Thu, 7 Dec 2017 at 10:39 am, satyajit vegesna <
> satyajit.apas...@gmail.com> wrote:
>
>> Does spark support automatic detection of schema from a json str
Yes, my thought exactly. Kindly let me know if you need any help to port in
pyspark.
On Mon, Nov 6, 2017 at 8:54 AM, Nicolas Paris <nipari...@gmail.com> wrote:
> Le 05 nov. 2017 à 22:46, ayan guha écrivait :
> > Thank you for the clarification. That was my understanding too
n column is provided?
>
> No, in this case, each worker send a jdbc call accordingly to
> documentation
> https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
>
> --
Best Regards,
Ayan Guha
---
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
ight_id, "
> + "case when length(a.id)/2 = '8' then a.desc else ' ' end as
> level_eight_name, "
> + "case when length(a.id)/2 = '9' then a.id
> else ' ' end as level_nine_id, "
> + "case when length(a.id)/2 = '9' then a.desc else ' ' end as
> level_nine_name, "
> + "case when length(a.id)/2 = '10' then a.id else ' ' end as
> level_ten_id, "
> + "case when length(a.id)/2 = '10' then a.desc
> else ' ' end as level_ten_name "
> + "from CategoryTempTable a")
>
>
> Can someone help me in also populating all the parents levels in the
> respective level ID and level name, please?
>
>
> Thanks,
> Aakash.
>
>
>
--
Best Regards,
Ayan Guha
,
> Asmath
>
> Sent from my iPhone
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha
sible to map rows belong to unique combination inside an iterator?
>
> e.g
>
> col1 col2 col3
> a 1 a1
> a 1 a2
> b 2 b1
> b 2 b2
>
> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>
> regards,
> Imran
>
> --
> I.R
>
--
Best Regards,
Ayan Guha
daJson);
>> val updatelambdaReq:InvokeRequest = new InvokeRequest();
>> updatelambdaReq.setFunctionName(updateFunctionName);
>> updatelambdaReq.setPayload(updatedLambdaJson.toString());
>> System.out.println("Calling lambda to add log");
>
ds appended, time etc. Instead of one single entry in the database,
> multiple entries are being made to it. Is it because of parallel execution
> of code in workers? If it is so then how can I solve it so that it only
> writes once.
>
> *Thanks!*
>
> *Cheers!*
>
> Harsh Choudhary
>
--
Best Regards,
Ayan Guha
will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
--
Best Regards,
Ayan Guha
gt;> ABZ|ABZ|AF|3|2|730
>> ABZ|ABZ|AF|3|3|730
>> .
>> .
>> .
>> ABZ|ABZ|AF|Y|4|730
>> ABZ|ABZ|AF||Y|5|730
>>
>> Basically, I want to consider the various combinations of the 4th and 5th
>> columns (where the values are delimited by commas) and accordingly generate
>> the above rows from a single row. Please can you suggest me for a good way
>> of acheiving this. Thanks in advance !
>>
>> Regards,
>>
>> Debu
>>
>
>
--
Best Regards,
Ayan Guha
> 13);
>
> Is there any other param that needs to be set as well?
>
> Thanks
>
> On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> I have not tested this, but you should be able to pass on any map-reduce
>> like conf to underlying
while reading from HDFS itself
> instead of using repartition() etc.,
> >
> > Any suggestions are helpful!
> >
> > Thanks
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
ribe e-mail: user-unsubscr...@spark.apache.org
>
>
> The jdbc will load data into the driver node, this may slow down the
> speed,and may OOM.
>
>
> -----
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
ark says it finished the job and saved the data?
> 2. What can we do to ensure that we write data to ES in a consistent
> manner?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
to memory errors on very huge datasets.
>>
>>
>> Anybody knows or can point me to relevant documentation ?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
> --
Best Regards,
Ayan Guha
; t.x=1 && t.y=2)
> >
> > Approach 2:
> > .filter (t-> t.x=1)
> > .filter (t-> t.y=2)
> >
> > Is there a difference or one is better than the other or both are same?
> >
> > Thanks!
> > Ahmed Mahmoud
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
ache.spark.sql.functions._
> val result = singleRowDF
> .withColumn("dummy", explode(array((1 until 100).map(lit): _*)))
> .selectExpr(singleRowDF.columns: _*)
>
> How can I create a column from an array of values in Java and pass it to
> explode function? Suggestions are helpful.
>
>
> Thanks
> Kanagha
>
--
Best Regards,
Ayan Guha
ot use parallelism.
>>
>
> This is not true, unless the file is small or is gzipped (gzipped files
> cannot be split).
>
--
Best Regards,
Ayan Guha
>
>
>
>
> I probably missed something super obvious, but can’t find it…
>
>
>
> Any help/hint is welcome! - TIA
>
>
>
> jg
>
>
>
>
>
>
>
--
Best Regards,
Ayan Guha
,
>> adding to that partitioning by date (in daily ETLs for instance) would
>> probably cause a data skew (right?), but why am I getting OOMs? Isn't Spark
>> supposed to spill to disk if the underlying RDD is too big to fit in memory?
>>
>> If I'm not using "partitionBy" with the writer (still exploding)
>> everything works fine.
>>
>> This happens both in EMR and in local (mac) pyspark/spark shell (tried
>> both in python and scala).
>>
>> Thanks!
>>
>>
>>
> --
Best Regards,
Ayan Guha
tried amazon elastic cache.Just give me some pointers. Thanks!
>
--
Best Regards,
Ayan Guha
t;>>>
>>>> #
>>>>
>>>> val spark = SparkSession.builder().appName("SampleJob").config("spark.
>>>> master", "local") .getOrCreate()
>>>>
>>>> val df = this is dataframe which has list of file names (hdfs)
>>>>
>>>> df.foreach { fileName =>
>>>>
>>>> *spark.read.json(fileName)*
>>>>
>>>> .. some logic here
>>>> }
>>>>
>>>> #
>>>>
>>>>
>>>> *spark.read.json(fileName) --- this fails as it runs in executor. When
>>>> I put it outside foreach, i.e. in driver, it works.*
>>>>
>>>> As I am trying to use spark (sparkSession) in executor which is not
>>>> visible outside driver. But I want to read hdfs files inside foreach, how
>>>> do I do it.
>>>>
>>>> Can someone help how to do this.
>>>>
>>>> Thanks,
>>>> Chackra
>>>>
>>>
>>>
>>
>
--
Best Regards,
Ayan Guha
;).rdd.map(_.getString(0
>> ).split(",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
>> //val oneRow =
>> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>>
>> val severalRows = rddX.map(row => {
>> // Split array into n tools
>> println("ROW: " + row(0).toString)
>> println(row(0).getClass)
>> println("PRINT: " +
>> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
>> ))).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
>> (row)).toDF("event_name")).select("eventIndex").first().getDouble(0), Seq
>> (row).toString)
>> })
>> // attempting
>>
>>
>> --
Best Regards,
Ayan Guha
> wrote:
>>
>>> toJSON on Dataset/DataFrame?
>>>
>>> --
>>> *From:* kant kodali <kanth...@gmail.com>
>>> *Sent:* Saturday, September 9, 2017 4:15:49 PM
>>> *To:* user @spark
>>> *Subject:* How to convert Row to JSON in Java?
>>>
>>> Hi All,
>>>
>>> How to convert Row to JSON in Java? It would be nice to have .toJson()
>>> method in the Row class.
>>>
>>> Thanks,
>>> kant
>>>
>>
>>
>
--
Best Regards,
Ayan Guha
, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData <http://twitter.com/BobLovesData>
>
>
>
--
Best Regards,
Ayan Guha
ation,
>
> usecase: i want to read data from hive one cluster and write to hive on
> another cluster
>
> Please suggest if this can be done?
>
>
>
--
Best Regards,
Ayan Guha
Any help on this?
On Thu, Aug 31, 2017 at 10:30 AM, ayan guha <guha.a...@gmail.com> wrote:
> Hi
>
> Want to understand a basic issue. Here is my code:
>
> def createStreamingContext(sparkCheckpointDir: String,batchDuration: Int
> ) = {
>
> val ssc = new Strea
;> ..that's the reason I'm planning for mapgroups with function as argument
>>> which takes rowiterator ..but not sure if this is the best to implement as
>>> my initial dataframe is very large
>>>
>>> On Tue, Aug 29, 2017 at 10:24 PM ayan guha <guha.a...@gm
29, 2017 at 7:50 PM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> That just gives you the max time for each train. If I understood the
>> question correctly, OP wants the whole row with the max time. That's
>> generally solved through joins or subqueries, which would be har
>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am wondering what is the easiest and concise way to express the
>>>>>> computation below in Spark Structured streaming given that it supports
>>>>>> both
>>>>>> imperative and declarative styles?
>>>>>> I am just trying to select rows that has max timestamp for each
>>>>>> train? Instead of doing some sort of nested queries like we normally do
>>>>>> in
>>>>>> any relational database I am trying to see if I can leverage both
>>>>>> imperative and declarative at the same time. If nested queries or join
>>>>>> are
>>>>>> not required then I would like to see how this can be possible? I am
>>>>>> using
>>>>>> spark 2.1.1.
>>>>>>
>>>>>> Dataset
>>>>>>
>>>>>> TrainDest Time1HK10:001SH12:001
>>>>>>SZ14:002HK13:002SH09:002
>>>>>> SZ07:00
>>>>>>
>>>>>> The desired result should be:
>>>>>>
>>>>>> TrainDest Time1SZ14:002HK13:00
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
--
Best Regards,
Ayan Guha
*val *empList = empData.toList
>>
>>
>> *//calculate income has Logic to figureout latest income for an account
>> and returns latest income val *income = calculateIncome(empList)
>>
>>
>> *for *(i <- empList) {
>>
>> *val *row = i
>>
>> *return new *employee(row.EmployeeID, row.INCOMEAGE , income)
>> }
>> *return "Done"*
>>
>>
>>
>> }
>>
>>
>>
>> Is this a better approach or even the right approach to implement the
>> same.If not please suggest a better way to implement the same?
>>
>>
>>
>> --
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>
--
Best Regards,
Ayan Guha
his is equal to --package in
> spark-submit.
>
> BTW you'd better ask livy question in u...@livy.incubator.apache.org.
>
> Thanks
> Jerry
>
> On Thu, Aug 24, 2017 at 8:11 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Hi
>>
>> I have a python p
t not able to understand how/where
to configure the "packages" switch...Any help?
--
Best Regards,
Ayan Guha
ing table in MySQL
>> <https://stackoverflow.com/questions/34643200/spark-dataframes-upsert-to-postgres-table>
>> and
>> then perform the UPDATE query on the MySQL side.
>>
>>
>>
>> Ideally, I’d like to handle the update during the write operation. Has
>> anyone else encountered this limitation and have a better solution?
>>
>>
>>
>> Thank you,
>>
>>
>>
>> Jake
>>
>
> --
Best Regards,
Ayan Guha
gt;>> >>> Hi,
>>>> >>>
>>>> >>> I have written spark sql job on spark2.0 by using scala . It is
>>>> just pulling the data from hive table and add extra columns , remove
>>>> duplicates and then write it back to hive again.
>>>> >>>
>>>> >>> In spark ui, it is taking almost 40 minutes to write 400 go of
>>>> data. Is there anything that I need to improve performance .
>>>> >>>
>>>> >>> Spark.sql.partitions is 2000 in my case with executor memory of
>>>> 16gb and dynamic allocation enabled.
>>>> >>>
>>>> >>> I am doing insert overwrite on partition by
>>>> >>> Da.write.mode(overwrite).insertinto(table)
>>>> >>>
>>>> >>> Any suggestions please ??
>>>> >>>
>>>> >>> Sent from my iPhone
>>>> >>>
>>>> -
>>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>> >>>
>>>>
>>>
>>>
>>
> --
Best Regards,
Ayan Guha
user to be able to read that path using the
> credentials fetched above.
>
> Any help is appreciated.
>
> Thanks,
> Imtiaz
>
--
Best Regards,
Ayan Guha
00.00|1483286400.
> My idea is like below:
> 1.write a udf
> to add the userid to the beginning of every string element
> of listForRule77.
> 2.use
> val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap()
> , the result rdd2 maybe what I need.
>
> My question: Are there any problems in my idea? Is there a better
> way to do this ?
>
>
>
>
>
> ThanksBest regards!
> San.Luo
>
>
>
> --
Best Regards,
Ayan Guha
ent on 3
> node spark cluster?
>
> Android için Outlook <https://aka.ms/ghei36> uygulamasını edinin
>
>
--
Best Regards,
Ayan Guha
st.
> Log aggregation has not completed or is not enabled.
>
> Any other way to see my logs?
>
> Thanks
>
> John
>
>
>
>
> --
> *From:* ayan guha <guha.a...@gmail.com>
> *Sent:* Sunday, July 30, 2017 10:34 PM
> *To:* John Zen
;> I only see the first line which is outside of the 'mapToPair'. I actually
>> have verified my 'mapToPair' is called and the statements after the second
>> logging line were executed. The only issue for me is why the second
>> logging
>> is not in JobTracker UI.
>>
>> Appreciate your help.
>>
>> Thanks
>>
>> John
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Logging-in-RDD-mapToPair-of-Java-Spark-application-tp29007.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
> --
Best Regards,
Ayan Guha
: long (nullable = true)
|-- START_DATE: string (nullable = true)
Is there any way to do it?
I have control over to use OrderedDict vs normal dict, but the column order
is the requirement. Any help would be great!!
--
Best Regards,
Ayan Guha
rkSession.builder().enableHiveSupport().getOrCreate()
>>
>> *spark.sqlContext.setConf("hive.exec.max.dynamic.partitions", "2000")*
>>
>> Please help with alternate workaround !
>>
>> Thanks
>>
>>
> --
Best Regards,
Ayan Guha
d1 -> fie...|
| 3|Map(field1 -> fie...|
| 4|Map(field1 -> fie...|
+---++
Thanks for creating such useful functions, spark devs :)
Best
Ayan
On Thu, Jul 27, 2017 at 2:30 PM, ayan guha <guha.a...@gmail.com> wrote:
> Hi
>
> I want to create a s
n is). Is
there any other simpler way to achieve this? Probably using withColumn API?
I saw a similar post here
<https://stackoverflow.com/questions/44223751/how-to-add-empty-map-type-column-to-dataframe>but
not able to use the trick Jacek suggested.
--
Best Regards,
Ayan Guha
.ForeachWriter>
>>> is
>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>> adds deployment overhead because now I will have to support Scala in my app.
>>>
>>> Another approach is obviously to use Scala instead of python, which is
>>> fine but I want to make sure that I absolutely cannot use python for this
>>> problem before I take this path.
>>>
>>> Would appreciate some feedback and alternative design approaches for
>>> this problem.
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>
> --
Best Regards,
Ayan Guha
try" FROM
> (SELECT * FROM dfs.root.`output.parquet`) AS Customers ) LIMIT 0
>
> Now, the Encountered quote is at "CustomerID" in the query.
>
> I tried to run the following query in Drill shell:
>
> SELECT "CustomerID" from dfs.root.`output.parquet`;
>
> It gives the same error of 'Encountered "\"" '.
>
> I want to ask if there is any way to remove the above "SELECT
> "CustomerID","First_name","Last_name","Email","Gender","Country" FROM" from
> the above query formulated by Spark and pushed down to Apache Drill via
> JDBC driver.
>
> Or any other way around like removing the Quotes?
>
>
> Thanks,
>
> Luqman
>
--
Best Regards,
Ayan Guha
ike that? Then also have query
> engine respect it
>
> Thanks,
>
> Nishit
>
--
Best Regards,
Ayan Guha
icallocation=true but it got only one
>>>> executor.
>>>>
>>>> Normally this situation occurs when any of the JOB runs with the
>>>> Scheduler.mode= FIFO.
>>>>
>>>> 1) Have your ever faced this issue if so how to overcome this?.
>>>>
>>>> I was in the impression that as soon as I submit the JOB B the Spark
>>>> Scheduler should distribute/release few resources from the JOB A and share
>>>> it with the JOB A in the Round Robin fashion?.
>>>>
>>>> Appreciate your response !!!.
>>>>
>>>>
>>>> Thanks & Regards,
>>>> Gokula Krishnan* (Gokul)*
>>>>
>>>
>>>
>>
>
--
Best Regards,
Ayan Guha
2017 at 7:42 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> You should download a pre built version. The code you have got is source
>> code, you need to build it to generate the jar files.
>>
>>
> Hi Ayan,
>
> Can you please help me understand to build
o https://spark.apache.org/downloads.html.
>
> Any help will be highly appreciable. Thanks in Advance.
>
> Regards,
>
> Kaushal
>
--
Best Regards,
Ayan Guha
;)
> .master("local[4]")
> .getOrCreate();
>
> final Properties connectionProperties = new Properties();
> connectionProperties.put("user", *"some_user"*));
> connectionProperties.put("password", "some_pwd"));
>
>
rame --- join / groupBy-agg
> question...
> <http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-join-groupBy-agg-question-tp28849p28880.html>
>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
--
Best Regards,
Ayan Guha
Hi
Anyone here any exp in integrating spark with azure keyvault?
--
Best Regards,
Ayan Guha
may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 17 July 2017 at 19:34, ayan guha <guha.a...@gmail.com> wrote:
>
>
schema.substring(0,schema.length-1)
> val sqlSchema = StructType(schema.split(",").map(s=>StructField(s,
> StringType,false)))
> sqlContext.createDataFrame(newDataSet,sqlSchema).show()
>
> Regards
> Pralabh Kumar
>
>
> On Mon, Jul 17, 2017 at 1:55 PM, n
to
> exist
> >> > in
> >> > the same directory on each executor node?
> >> >
> >> > cheers,
> >> >
> >> >
> >> >
> >> > Dr Mich Talebzadeh
> >> >
> >> >
> &g
t; from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha
quot;,"\\^")).select($"phone".getItem(0).as("phone1"),$"phone".getItem(1).as("phone2”))
> I though of doing this way but the problem is column are having 100+
> separator between the column values
>
>
>
> Thank you,
> Nayan
>
--
Best Regards,
Ayan Guha
gain for your time :)
>
> On Sat, Jul 8, 2017 at 8:06 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Hi
>>
>> Mostly from SO to find overlapping time, adapted to Spark
>>
>>
>> %pyspark
>> from pyspark.sql import Row
>> d = [Row(t="2017-0
;
> On Thu, Jul 6, 2017 at 5:28 PM, kant kodali <kanth...@gmail.com> wrote:
>
>> HI All,
>>
>> I am wondering If I pass a raw SQL string to dataframe do I still get the
>> Spark SQL optimizations? why or why not?
>>
>> Thanks!
>>
>
>
--
Best Regards,
Ayan Guha
le = sc.cassandraTable("test", "ttable")
> println(ttable.count)
>
> Some help please to join both things. Scala or Python code for me it's ok.
> Thanks in advance.
> Cheers.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
, Saatvik Shah <saatvikshah1...@gmail.com>
wrote:
> Hey Ayan,
>
> This isnt a typical text file - Its a proprietary data format for which a
> native Spark reader is not available.
>
> Thanks and Regards,
> Saatvik Shah
>
> On Thu, Jun 29, 2017 at 6:48 PM, ayan
sage in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha
rk.xml").option(
> "rowTag","book").load("books.xml")
>
> val xData = dfX.registerTempTable("books")
>
> dfX.printSchema()
>
> val books_inexp =sqlContext.sql("select title,author from books where
> price<10")
>
> books_inexp.show
>
>
>
> Regards,
>
> Amol
> This message contains information that may be privileged or confidential
> and is the property of the Capgemini Group. It is intended only for the
> person to whom it is addressed. If you are not the intended recipient, you
> are not authorized to read, print, retain, copy, disseminate, distribute,
> or use this message or any part thereof. If you receive this message in
> error, please notify the sender immediately and delete all copies of this
> message.
>
--
Best Regards,
Ayan Guha
execution.datasources.DataSource.sourceSchema(DataSource.scala:192)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
>>>
>>> at
>>> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>>>
>>> at
>>> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
>>>
>>> ... 48 elided
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.kafka.common.serialization.ByteArrayDeserializer
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> ... 57 more
>>>
>>>
>>> ++
>>>
>>> i have tried building the jar with dependencies, but still face the same
>>> error.
>>>
>>> But when i try to do --package with spark-shell using bin/spark-shell
>>> --package org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0 , it works
>>> fine.
>>>
>>> The reason, i am trying to build something from source code, is because
>>> i want to try pushing dataframe data into kafka topic, based on the url
>>> https://github.com/apache/spark/commit/b0a5cd89097c563e9949d8cfcf84d18b03b8d24c,
>>> which doesn't work with version 2.1.0.
>>>
>>>
>>> Any help would be highly appreciated.
>>>
>>>
>>> Regards,
>>>
>>> Satyajit.
>>>
>>>
>>>
>>
> --
Best Regards,
Ayan Guha
HBase writer? Or any other way to write continuous data to HBase?
best
Ayan
On Tue, Jun 27, 2017 at 2:15 AM, Weiqing Yang <yangweiqing...@gmail.com>
wrote:
> For SHC documentation, please refer the README in SHC github, which is
> kept up-to-date.
>
> On Mon, Jun 26, 2017 at
wrote:
> Hi,
> I recently switched from scala to python, and wondered which IDE people
> are using for python. I heard about pycharm, spyder etc. How do they
> compare with each other?
>
> Thanks,
> Shawn
>
--
Best Regards,
Ayan Guha
ability to configure closure serializer
>- HTTPBroadcast
>- TTL-based metadata cleaning
>- *Semi-private class org.apache.spark.Logging. We suggest you use
>slf4j directly.*
>- SparkContext.metricsSystem
>
> Thanks,
>
> Mahesh
>
>
>
>
>
&g
f the spark streaming application?
>
>
> Any help would be appreciated.
>
> Thanks,
> Naveen
>
>
>
--
Best Regards,
Ayan Guha
wrote:
> Yes.
> What SHC version you were using?
> If hitting any issues, you can post them in SHC github issues. There are
> some threads about this.
>
> On Fri, Jun 23, 2017 at 5:46 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Hi
>>
>> Is it possible t
st Regards,
Ayan Guha
-of-parallelism and
> its spark.default.parallelism setting
>
> Best,
>
> On Mon, Jun 12, 2017 at 6:18 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> I understand how it works with hdfs. My question is when hdfs is not the
>> file sustem, how number of parti
name).
>>>
>>> In short, it is a high level SQL-like DSL (Domain Specific Language) on
>>> top of Spark. People can use that DSL to write Spark jobs without worrying
>>> about Spark internal details. Please check README
>>> <https://github.com/uber/uberscriptquery> in the project to get more
>>> details.
>>>
>>> It will be great if I could get any feedback or suggestions!
>>>
>>> Best,
>>> Bo
>>>
>>>
>
>
--
Best Regards,
Ayan Guha
he schema in Hive is different as in parquet file.
>> But this is a very strange case, as the same schema works fine for other
>> brands, which defined as a partition column, and share the whole Hive
>> schema as the above.
>>
>> If I query like: "select * from tablename where brand='*BrandB*' limit
>> 3:", everything works fine.
>>
>> So is this really caused by the Hive schema mismatch with parquet file
>> generated by Spark, or by the data within different partitioned keys, or
>> really a compatible issue between Spark/Hive?
>>
>> Thanks
>>
>> Yong
>>
>>
>>
--
Best Regards,
Ayan Guha
s/29011574/how-does-partitioning-work-for-data-from-files-on-hdfs
>
>
>
> Regards,
> Vaquar khan
>
>
> On Jun 11, 2017 5:28 AM, "ayan guha" <guha.a...@gmail.com> wrote:
>
>> Hi
>>
>> My question is what happens if I have 1 file of say 100g
> code then you will get 12 partition.
>
> r = sc.textFile("file://my/file/*")
>
> Not sure what you want to know about file system ,please check API doc.
>
>
> Regards,
> Vaquar khan
>
>
> On Jun 8, 2017 10:44 AM, "ayan guha" <guha.a...@g
Any one?
On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <guha.a...@gmail.com> wrote:
> Hi Guys
>
> Quick one: How spark deals (ie create partitions) with large files sitting
> on NFS, assuming the all executors can see the file exactly same way.
>
> ie, when I run
>
> r
t;>> .option("driver", "org.apache.hive.jdbc.HiveDriver")
>>>> .format("jdbc")
>>>> .load()
>>>> test.show()
>>>>
>>>>
>>>> Scala version: 2.11
>>>> Spark version: 2.1.0, i also tried 2.1.1
>>>> Hive version: CDH 5.7 Hive 1.1.1
>>>> Hive JDBC version: 1.1.1
>>>>
>>>> But this problem available on Hive with later versions, too.
>>>> I didn't find anything in mail group answers and StackOverflow.
>>>> Could you, please, help me with this issue or could you help me find
>>>> correct
>>>> solution how to query remote hive from spark?
>>>>
>>>> Thanks in advance!
>>>>
>>>
>>>
>> --
Best Regards,
Ayan Guha
File("hdfs://my/file")
Are the input formats used same in both cases?
--
Best Regards,
Ayan Guha
erent
>> from Edge node for Hadoop please?
>>
>> Thanks
>>
>> -- -- -
>> To unsubscribe e-mail: user-unsubscribe@spark.apache. org
>> <user-unsubscr...@spark.apache.org>
>>
>>
>>
>>
>>
> --
Best Regards,
Ayan Guha
(NumberFormatException.java:65)
> at java.lang.Long.parseLong(Long.java:589)
> at java.lang.Long.parseLong(Long.java:631)
> at
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
> at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
> at
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:42)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
> ... 56 elided
>
>
> Any ideas how this can work!
>
> Thanks
>
>
>
>
>
>
>
>
>
> --
Best Regards,
Ayan Guha
numOfSlices)
>>>> dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
>>>> .
>>>> .
>>>> .
>>>>
>>>> In parse(), each dump file is parsed and inserted into database using
>>>> SparlSQL. In order to do that, SparkContext is needed in the function parse
>>>> to use the sql() method.
>>>>
>>>
>>>
>
--
Best Regards,
Ayan Guha
uot;native data source"? I couldn't find any mention of this feature in
>> the SQL Programming Guide and Google was not helpful.
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> New York, NY 10001
>>
>> --
Best Regards,
Ayan Guha
e ask.
> Any cues on why this is happening would be very helpful, been stuck on
> this for two days now. Thank you for your time.
>
>
> ***versions***
>
> Zeppelin: 0.7.1
> Spark: 2.1.0
> Cassandra: 2.2.9
> Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
>
> *Spark cluster specs*
>
> 6 vCPUs, 32 GB memory = 1 node
>
> *Cassandra + Zeppelin server specs*
> 8 vCPUs, 52 GB memory
>
> --
Best Regards,
Ayan Guha
er_id1 feature100
>
> I want to get the result as follow
> user_id1 feature1 feature2 feature3 feature4 feature5...feature100
>
> Is there a more efficient way except join?
>
> Thanks!
>
> Didac Gil de la Iglesia
> PhD in Computer Science
> didacg...@gmail.com
> Spain: +34 696 285 544
> Sweden: +46 (0)730229737
> Skype: didac.gil.de.la.iglesia
>
> --
Best Regards,
Ayan Guha
spark/issues. HTH!
>
> On Thu, May 11, 2017 at 9:08 PM ayan guha <guha.a...@gmail.com> wrote:
>
>> Works for me tooyou are a life-saver :)
>>
>> But the question: should/how we report this to Azure team?
>>
>> On Fri, May 12, 2017 at 10:32 AM, Denny L
.__/\_,_/_/ /_/\_\ version 2.0.2.2.5.4.0-121
> /_/
>
> Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala>
>
> HTH!
>
>
> On Wed, May 10,
tDB connector,
> specifically version 0.0.1. Could you run the 0.0.3 version of the jar and
> see if you're still getting the same error? i.e.
>
> spark-shell --master yarn --jars azure-documentdb-spark-0.0.3-
> SNAPSHOT.jar,azure-documentdb-1.10.0.jar
>
>
> On Mon, May 8,
d:~/azure-spark-docdb-test/v1$ uname -a
Linux ed0-svochd 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux
sshuser@ed0-svochd:~/azure-spark-docdb-test/v1$
--
Best Regards,
Ayan Guha
<zemin...@gmail.com> wrote:
>
>> I'm trying to decide whether to buy the book learning spark, spark for
>> machine learning etc. or wait for a new edition covering the new concepts
>> like dataframe and datasets. Anyone got any suggestions?
>>
>
>
--
Best Regards,
Ayan Guha
/Azure/azure-documentdb-spark) if sticking
> with Scala?
> On Thu, Apr 20, 2017 at 21:46 Nan Zhu <zhunanmcg...@gmail.com> wrote:
>
>> DocDB does have a java client? Anything prevent you using that?
>>
>> Get Outlook for iOS <https://aka.ms/o0ukef>
>> --
hdinsight/spark-eventhubs : which is
> eventhub receiver for spark streaming
> We are using it but you have scala version only i guess
>
>
> Thanks,
> Ashish Singh
>
> On Fri, Apr 21, 2017 at 9:19 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> [image: Boxbe] <
Hi
I am not able to find any conector to be used to connect spark streaming
with Azure Event Hub, using pyspark.
Does anyone know if there is such library/package exists>?
--
Best Regards,
Ayan Guha
st("m_123","m_111","m_145"):_*))
> count =0
> but
>
> df.filter($"msrid" isin (List("m_123"):_*))
> count=121212
>
> Any suggestion will do a great help to me.
>
> Thanks,
> Nayan
>
--
Best Regards,
Ayan Guha
; .filter(checkID($"ID"))
> > .select($"ID", $"BINARY")
> > .write...
> >
> > Thanks for any advice!
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-speed-up-further-filtering-tp28605.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
--
Best Regards,
Ayan Guha
;
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Shall-I-use-Apache-Zeppelin-for-data-analytics-visualization-tp28604.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>
>>
> --
Best Regards,
Ayan Guha
; bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
> beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
> evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
> nicht gestattet
>
--
Best Regards,
Ayan Guha
>> I tried below things
>>> 1. I'm creating dataframe name & temporary tabel name dynamically based
>>> in dataset name.
>>> 2. Enabled Spark Dynamic allocation (--conf
>>> spark.dynamicAllocation.enabled=true)
>>> 3. Set spark.scheduler.mode to FAIR
>>>
>>>
>>> I appreciate advise on
>>> 1. Is anything wrong in above implementation?
>>> 2. Is it good idea to process those big datasets in parallel in one job?
>>> 3. Any other solution to process multiple datasets in parallel?
>>>
>>> Thank you,
>>> Amol Patil
>>>
>>
>> --
Best Regards,
Ayan Guha
audience part of the task finished successfully and the failure
> was on a df that didn't touch it, it shouldn't've made a difference.
>
> Thank you!
>
> On Sat, Apr 15, 2017 at 9:07 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> What i missed is try increasing number
101 - 200 of 709 matches
Mail list logo