Re: spark multi tenancy

2015-10-07 Thread ayan guha
Can queues also be used to separate workloads? On 7 Oct 2015 20:34, "Steve Loughran" wrote: > > > On 7 Oct 2015, at 09:26, Dominik Fries > wrote: > > > > Hello Folks, > > > > We want to deploy several spark projects and want to use a unique

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread ayan guha
il: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread ayan guha
Do you have a benchmark to say running these two statements as it is will be slower than what you suggest? On 9 Jul 2015 01:06, Brandon White bwwintheho...@gmail.com wrote: The point of running them in parallel would be faster creation of the tables. Has anybody been able to efficiently

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread ayan guha
Can you please post result of show()? On 10 Jul 2015 01:00, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I just re-wrote a query from using UNION ALL to use with rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed but wanted to check if this is user error. Here

Re: Data Processing speed SQL Vs SPARK

2015-07-09 Thread ayan guha
). Is there is any scaling is there to decide what technology is best?either SQL or SPARK? On Thu, Jul 9, 2015 at 9:40 AM, ayan guha guha.a...@gmail.com wrote: It depends on workload. How much data you would want to process? On 9 Jul 2015 22:28, vinod kumar vinodsachin...@gmail.com wrote: Hi Everyone

Re: Ordering of Batches in Spark streaming

2015-07-10 Thread ayan guha
ordering within batches* .But i doubt is there any change from old spark versions to spark 1.4 in this context. Any Comments please !! -- Thanks Regards, Anshu Shukla -- Best Regards, Ayan Guha

Re: Is it possible to change the default port number 7077 for spark?

2015-07-10 Thread ayan guha
SSH by default should be on port 22. 7456 is the port is where master is listening. So any spark app should be able to connect to master using that port. On 11 Jul 2015 13:50, ashishdutt ashish.du...@gmail.com wrote: Hello all, In my lab a colleague installed and configured spark 1.3.0 on a 4

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread ayan guha
For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: How to submit streaming application and exit

2015-07-07 Thread ayan guha
Wang wbi...@gmail.com wrote: I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible? -- Best Regards, Ayan Guha

Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread ayan guha
to the spark history server. When I run spark-shell master ip: port number I get the following output How can I verify that the worker is connected to the master? Thanks, Ashish -- Best Regards, Ayan Guha

Re: Using Hive UDF in spark

2015-07-08 Thread ayan guha
it using sqlContext.udf.register,but when I restarted a service the UDF was not available. I've heared that Hive UDF's are permanently stored in hive.(Please Correct me if I am wrong). Thanks, Vinod -- Best Regards, Ayan Guha

Re: databases currently supported by Spark SQL JDBC

2015-07-09 Thread ayan guha
are the databases currently supported by Spark JDBC relation provider? rgds -- Niranda @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ -- Best Regards, Ayan Guha

Re: Data Processing speed SQL Vs SPARK

2015-07-09 Thread ayan guha
It depends on workload. How much data you would want to process? On 9 Jul 2015 22:28, vinod kumar vinodsachin...@gmail.com wrote: Hi Everyone, I am new to spark. Am using SQL in my application to handle data in my application.I have a thought to move to spark now. Is data processing speed

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread ayan guha
a 'describe table' from SparkSQL CLI it seems to try looking at all records at the table (which takes a really long time for big table) instead of just giving me the metadata of the table. Would appreciate if someone can give me some pointers, thanks! -- Best Regards, Ayan Guha

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread ayan guha
) at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274) Does spark support this Hive function posexplode? If not, how to patch it to support this? I am on Spark 1.3.1 Thanks, Jeff Li -- Best Regards, Ayan Guha

Re: Hbase Lookup

2015-09-03 Thread ayan guha
gt;> implements similar logic in your Pig UDF. >>> >>> Both approaches look similar. >>> >>> Personally, I would go with Spark solution, it will be slightly faster, >>> and easier if you already have Spark cluster setup on top of your hadoop >

Re: Hbase Lookup

2015-09-02 Thread ayan guha
elevancy scores. > > > You can use also Spark and Pig there. However, I am not sure if Spark is > suitable for these one row lookups. Same holds for Pig. > > > Le mer. 2 sept. 2015 à 23:53, ayan guha <guha.a...@gmail.com> a écrit : > > Hello group > > I am t

Hbase Lookup

2015-09-02 Thread ayan guha
Hello group I am trying to use pig or spark in order to achieve following: 1. Write a batch process which will read from a file 2. Lookup hbase to see if the record exists. If so then need to compare incoming values with hbase and update fields which do not match. Else create a new record. My

Re: Spark - launchng job for each action

2015-09-06 Thread ayan guha
; println("Count is"+count) >> println("First is"+firstElement) >> >> Now, rdd2.count launches job0 with 1 task and rdd2.first launches job1 >> with 1 task. Here in job2, when calculating rdd.first, is the entire >> lineage computed again or else as job0 already computes rdd2, is it reused >> ??? >> >> Thanks, >> Padma Ch >> >> > > > > -- > Best Regards > > Jeff Zhang > -- Best Regards, Ayan Guha

Re: Relational Log Data

2015-09-15 Thread ayan guha
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Spark Streaming Suggestion

2015-09-15 Thread ayan guha
I think you need to make up your mind about storm vs spark. Using both in this context does not make much sense to me. On 15 Sep 2015 22:54, "David Morales" wrote: > Hi there, > > This is exactly our goal in Stratio Sparkta, a real-time aggregation > engine fully developed

Re: Directly reading data from S3 to EC2 with PySpark

2015-09-15 Thread ayan guha
Also you can set hadoop conf through jsc.hadoopConf property. Do a dir (sc) to see exact property name On 15 Sep 2015 22:43, "Gourav Sengupta" wrote: > Hi, > > If you start your EC2 nodes with correct roles (default in most cases > depending on your needs) you should

Re: HDFS small file generation problem

2015-09-27 Thread ayan guha
y) by adding on the fly my > event ? > > Tks a lot > Nicolas > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Py4j issue with Python Kafka Module

2015-09-23 Thread ayan guha
..@gmail.com> > wrote: > >> I think it is something related to class loader, the behavior is >> different for classpath and --jars. If you want to know the details I think >> you'd better dig out some source code. >> >> Thanks >> Jerry >> &

Re: Spark Ingestion into Relational DB

2015-09-21 Thread ayan guha
- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
gContext, class java.util.HashMap, class >> java.util.HashSet, >> class java.util.HashMap]) does not exist >> at >> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) >> >> at >> py4j.reflection.Reflection

Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) >>> Am I doing something wrong? -- Best Regards, Ayan Guha

Re: Join over many small files

2015-09-23 Thread ayan guha
ficient query over such dataframe? > > > > Any advice will be appreciated. > > > > Best regards, > > Lucas > > > > == > Please access the attached hyperlink for an important electronic > communications disclaimer: > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > == > -- Best Regards, Ayan Guha

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread ayan guha
, I am using spark 1.3.0 with CDH 5.4 > > [image: Inline image 1] > > > > Thanks > Gokul > > -- Best Regards, Ayan Guha

Re: Experiences about NoSQL databases with Spark

2015-12-06 Thread ayan guha
age in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462p25594.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> ----- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- Best Regards, Ayan Guha

Re: how create hbase connect?

2015-12-07 Thread ayan guha
base on Rdd? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Merge rows into csv

2015-12-08 Thread ayan guha
-- > ID STATE > - > 1 TX > 1NY > 1FL > 2CA > 2OH > - > > This is the required output: > - > IDCSV_STATE > - > 1 TX,NY,FL > 2 CA,OH > - > -- Best Regards, Ayan Guha

Re: sparkavro for PySpark 1.3

2015-12-05 Thread ayan guha
Loader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 26 more > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/sparkavro-for-PySpark-1-3-tp25561p25574.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Low Latency SQL query

2015-12-01 Thread ayan guha
You can try query push down by creating the query while creating the rdd. On 2 Dec 2015 12:32, "Fengdong Yu" wrote: > It depends on many situations: > > 1) what’s your data format? csv(text) or ORC/parquet? > 2) Did you have Data warehouse to summary/cluster your

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread ayan guha
One more thing I feel for better maintability would be to create a dB view and then use the view in spark. This will avoid burying complicated SQL queries within application code. On 8 Dec 2015 05:55, "Wang, Ningjun (LNG-NPV)" wrote: > This is a very helpful article.

Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread ayan guha
name would keep unchanged. > Just need finish it in spark program, but not hdfs commands. > Is there any codes, it seems not to be done by searching spark doc ... > > Thanks in advance! > -- Best Regards, Ayan Guha

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread ayan guha
frame: > > a | b | c > -- > 1 | 1 | 1 > 2 | 1 | 4 > 3 | 1 | 7 > -- > > The dataframe I have is huge so get the minimum value of b from each group > and joining on the original dataframe is very expensive. Is there a better > way to do this? > > > Thanks, > Wei > > -- Best Regards, Ayan Guha

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread ayan guha
gt;>>>>> step, which should be another 6.2TB shuffle read. >>>>>> >>>>>> I think to Dedup, the shuffling can not be avoided. Is there anything >>>>>> I could do to stablize this process? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <yue.yuany...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hey, >>>>>>> >>>>>>> I got everyday's Event table and want to merge them into a single >>>>>>> Event table. But there so many duplicates among each day's data. >>>>>>> >>>>>>> I use Parquet as the data source. What I am doing now is >>>>>>> >>>>>>> EventDay1.unionAll(EventDay2).distinct().write.parquet("a new >>>>>>> parquet file"). >>>>>>> >>>>>>> Each day's Event is stored in their own Parquet file >>>>>>> >>>>>>> But it failed at the stage2 which keeps losing connection to one >>>>>>> executor. I guess this is due to the memory issue. >>>>>>> >>>>>>> Any suggestion how I do this efficiently? >>>>>>> >>>>>>> Thanks, >>>>>>> Gavin >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > -- Best Regards, Ayan Guha

Re: spark rdd grouping

2015-12-01 Thread ayan guha
> PairRdd is basically constrcuted using kafka streaming low level consumer > > which have all records with same key already in same partition. Can i > group > > them together with avoid shuffle. > > > > Thanks > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-12 Thread ayan guha
nfiguration().set("fs.s3.awsAccessKeyId", "") >>> sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", >>> "") >>> >>> 2. Set keys in URL, e.g.: >>> sc.textFile("s3n://@/bucket/test/testdata") >>> >>> >>> Both if which I'm reluctant to do within production code! >>> >>> >>> Cheers >>> >> >> -- Best Regards, Ayan Guha

Re: pre-install 3-party Python package on spark cluster

2016-01-12 Thread ayan guha
Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > -- Best Regards, Ayan Guha

Re: Spark SQL Errors

2016-05-31 Thread ayan guha
gt; > > http://talebzadehmich.wordpress.com > > > > On 31 May 2016 at 06:31, ayan guha <guha.a...@gmail.com> wrote: > >> No there is no semicolon. >> >> This is the query: >> >> 16/05/31 14:34:29 INFO SparkExecuteStatementOperation: Running

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread ayan guha
> could be stored in Hive yet your only access method is via a JDBC or > Thift/Rest service. Think also of compute / storage cluster > implementations. > > WRT to #2, not exactly what I meant, by exposing the data… and there are > limitations to the thift service… > > On Jun 2

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread ayan guha
ultimately resides. There really is > a method to my madness, and if I could explain it… these questions really > would make sense. ;-) > > TIA, > > -Mike > > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Python to Scala

2016-06-18 Thread ayan guha
;>>> will cut the effort of learning scala. >>>>> >>>>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html >>>>> >>>>> - Thanks, via mobile, excuse brevity. >>>>> On Jun 18, 2016 2:34 PM, "Aakash Basu" <raj2coo...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I've a python code, which I want to convert to Scala for using it in >>>>>> a Spark program. I'm not so well acquainted with python and learning >>>>>> scala >>>>>> now. Any Python+Scala expert here? Can someone help me out in this >>>>>> please? >>>>>> >>>>>> Thanks & Regards, >>>>>> Aakash. >>>>>> >>>>> >>> -- Best Regards, Ayan Guha

Re: Partitioning in spark

2016-06-23 Thread ayan guha
col4 then why does it shuffle everything whereas > it need to sort each partitions and then should grouping there itself. > > Bit confusing , I am using 1.5.1 > > Is it fixed in future versions. > > Thanks > -- Best Regards, Ayan Guha

Re: add multiple columns

2016-06-26 Thread ayan guha
when you have multiple i have > to loop on eache columns ? > > > > thanks > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: AM creation in yarn client mode

2016-02-09 Thread ayan guha
created on the client where the job was submitted? i.e driver and > AM on the same client? > Or > B) yarn decides where the the AM should be created? > > 2) Driver and AM run in different processes : is my assumption correct? > > Regards, > Praveen > -- Best Regards, Ayan Guha

Re: Spark Integration Patterns

2016-02-28 Thread ayan guha
g a process on a remote host >>>> to execute a shell script seems like a lot of effort What are the >>>> recommended ways to connect and query Spark from a remote client ? Thanks >>>> Thx ! >>>> -- >>>> View this message in context: Spark Integration Patterns >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354.html> >>>> Sent from the Apache Spark User List mailing list archive >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >>>> >>> >>> > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ > -- Best Regards, Ayan Guha

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-26 Thread ayan guha
of now. See this ticket > <https://issues.apache.org/jira/browse/SPARK-4226> for more on this. > > > > [image: http://] > > Tariq, Mohammad > about.me/mti > [image: http://] > <http://about.me/mti> > > > On Fri, Feb 26, 2016 at 7:01 AM, ayan guha &l

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread ayan guha
n (select max(column1) from mytable) > > Thanks > -- Best Regards, Ayan Guha

Re: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread ayan guha
Why can't you use the jdbc in hive context? I don't think sharing data across contexts are allowed. On 15 Feb 2016 07:22, "Mich Talebzadeh" wrote: > I am intending to get a table from Hive and register it as temporary table > in Spark. > > > > I have created contexts for

Re: Spark Error: Not enough space to cache partition rdd

2016-02-14 Thread ayan guha
Have you tried repartition to larger number of partitions? Also, I would suggest increase number of executors and give them smaller amount of memory each. On 15 Feb 2016 06:49, "gustavolacerdas" wrote: > I have a machine with 96GB and 24 cores. I'm trying to run a

Re: Spark Certification

2016-02-14 Thread ayan guha
Thanks. Do we have any forum or study group for certification aspirants? I would like to join. On 15 Feb 2016 05:53, "Olivier Girardot" wrote: > It does not contain (as of yet) anything > 1.3 (for example in depth > knowledge of the Dataframe API) > but you need

Re: is Hbase Scan really need thorough Get (Hbase+solr+spark)

2016-01-19 Thread ayan guha
Value("rowkey"))); > list.add(get); > > } > > *Result[] res = table.get(list);//This is really need? because it takes > extra time to scan right?* > This piece of code i got from > http://www.programering.com/a/MTM5kDMwATI.html > > please correct if anything wrong :) > > Thanks > Beesh > > -- Best Regards, Ayan Guha

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread ayan guha
or management of > Spark resources) ? > > Thank you > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread ayan guha
to read from Kafka there > are 5 tasks writing to E/S. So I'm supposing that the task reading from > Kafka does it in // using 5 partitions and that's why there are then 5 > tasks to write to E/S. But I'm supposing ... > > On Feb 16, 2016, at 21:12, ayan guha <guha.a...@gmail.com>

Re: Stream group by

2016-02-21 Thread ayan guha
>>> t1, file1, 1, 1, 1 >>>>> t1, file1, 1, 2, 3 >>>>> t1, file2, 2, 2, 2, 2 >>>>> t2, file1, 5, 5, 5 >>>>> t2, file2, 1, 1, 2, 2 >>>>> >>>>> and i want to achieve the output like below rows which is a vertical >>>>> addition of the corresponding numbers. >>>>> >>>>> *Output* >>>>> “file1” : [ 1+1+5, 1+2+5, 1+3+5 ] >>>>> “file2” : [ 2+1, 2+1, 2+2, 2+2 ] >>>>> >>>>> I am in a spark streaming context and i am having a hard time trying >>>>> to figure out the way to group by file name. >>>>> >>>>> It seems like i will need to use something like below, i am not sure >>>>> how to get to the correct syntax. Any inputs will be helpful. >>>>> >>>>> myDStream.foreachRDD(rdd => rdd.groupBy()) >>>>> >>>>> I know how to do the vertical sum of array of given numbers, but i am >>>>> not sure how to feed that function to the group by. >>>>> >>>>> def compute_counters(counts : ArrayBuffer[List[Int]]) = { >>>>> counts.toList.transpose.map(_.sum) >>>>> } >>>>> >>>>> ~Thanks, >>>>> Vinti >>>>> >>>> >>>> >>> >> > -- Best Regards, Ayan Guha

Zeppelin Integration

2016-03-10 Thread ayan guha
is not a good choice, yet, for the use case, what are the other alternatives? appreciate any help/pointers/guidance. -- Best Regards, Ayan Guha

Re: Zeppelin Integration

2016-03-10 Thread ayan guha
ou use a select query, the output is >> automatically displayed as a chart. >> >> As RDDs are bound to the context that creates them, I don't think >> Zeppelin can use those RDDs. >> >> I don't know if notebooks can be reused within other notebooks. It would >&g

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread ayan guha
gt; On Tue, Mar 8, 2016 at 8:50 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Why not compare current time in every batch and it meets certain >> condition emit the data? >> On 9 Mar 2016 00:19, "Abhishek Anand" <abhis.anan...@gmail.com> wrote: >&

Re: Spark Thriftserver

2016-03-16 Thread ayan guha
> It's same as hive thrift server. I believe kerberos is supported. > > On Wed, Mar 16, 2016 at 10:48 AM, ayan guha <guha.a...@gmail.com> wrote: > >> so, how about implementing security? Any pointer will be helpful >> >> On Wed, Mar 16, 2016 at 1:

Re: Sqoop on Spark

2016-04-05 Thread ayan guha
What you > guys think? > > On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Why do you want to reimplement something which is already there? >> >> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote: >> >&g

Sqoop on Spark

2016-04-05 Thread ayan guha
Hi All Asking opinion: is it possible/advisable to use spark to replace what sqoop does? Any existing project done in similar lines? -- Best Regards, Ayan Guha

Spark thrift issue 8659 (changing subject)

2016-03-23 Thread ayan guha
> > Hi All > > I found this issue listed in Spark Jira - > https://issues.apache.org/jira/browse/SPARK-8659 > > I would love to know if there are any roadmap for this? Maybe someone from > dev group can confirm? > > Thank you in advance > > Best > Ayan > >

Re: Zeppelin Integration

2016-03-23 Thread ayan guha
, 2016 at 10:32 PM, ayan guha <guha.a...@gmail.com> wrote: > Thanks guys for reply. Yes, Zeppelin with Spark is pretty compelling > choice, for single user. Any pointers for using Zeppelin for multi user > scenario? In essence, can we either (a) Use Zeppelin to connect to a long

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread ayan guha
Why not compare current time in every batch and it meets certain condition emit the data? On 9 Mar 2016 00:19, "Abhishek Anand" wrote: > I have a spark streaming job where I am aggregating the data by doing > reduceByKeyAndWindow with inverse function. > > I am keeping

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-04 Thread ayan guha
java:111) >>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>> 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch >>>>>> (1/3) for 6 outstanding blocks after 5000 ms >>>>>> 16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive >>>>>> connection to maprnode5, creating a new one. >>>>>> 16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in >>>>>> connection from maprnode5 >>>>>> java.io.IOException: Connection reset by peer >>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >>>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192) >>>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >>>>>> at >>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) >>>>>> at >>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) >>>>>> at >>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) >>>>>> at >>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) >>>>>> at >>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) >>>>>> at >>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) >>>>>> at >>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) >>>>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) >>>>>> at >>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) >>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>> 16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have 1 >>>>>> requests outstanding when connection from maprnode5 is closed >>>>>> 16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while >>>>>> starting block fetches >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> [image: What's New with Xactly] >>>>> <http://www.xactlycorp.com/email-click/> >>>>> >>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>> <http://www.youtube.com/xactlycorporation> >>>>> >>>>> >>>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >>> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> > -- Best Regards, Ayan Guha

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread ayan guha
`user.timestamp` as > rawTimeStamp, `user.requestId` as requestId, > *floor(`user.timestamp`/72000*) as timeBucket FROM logs"); > bucketLogs.toJSON().saveAsTextFile("target_file"); > > Regards > Ashok > -- Best Regards, Ayan Guha

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread ayan guha
ut some other knowledgeable people on the list, please chime >>> in). Two, since Spark is written in Scala, it gives you an enormous >>> advantage to read sources (which are well documented and highly readable) >>> should you have to consult or learn nuances of certain API method or action >>> not covered comprehensively in the docs. And finally, there’s a long term >>> benefit in learning Scala for reasons other than Spark. For example, >>> writing other scalable and distributed applications. >>> >>> >>> Particularly, we will be using Spark Streaming. I know a couple of years >>> ago that practically forced the decision to use Scala. Is this still the >>> case? >>> >>> >>> You’ll notice that certain APIs call are not available, at least for >>> now, in Python. >>> http://spark.apache.org/docs/latest/streaming-programming-guide.html >>> >>> >>> Cheers >>> Jules >>> >>> -- >>> The Best Ideas Are Simple >>> Jules S. Damji >>> e-mail:dmat...@comcast.net >>> e-mail:jules.da...@gmail.com >>> >>> ​ > -- Best Regards, Ayan Guha

Re: How to perform reduce operation in the same order as partition indexes

2016-05-19 Thread ayan guha
You can add the index from mappartitionwithindex in the output and order based on that in merge step On 19 May 2016 13:22, "Pulasthi Supun Wickramasinghe" wrote: > Hi Devs/All, > > I am pretty new to Spark. I have a program which does some map reduce > operations with

Re: Tar File: On Spark

2016-05-19 Thread ayan guha
f2.txt > > tar2: > > - f1.txt > > - f2.txt > > > > (each tar file will have exact same number of files, same name) > > > > I am trying to find a way (spark or pig) to extract them to their own > folders. > > > > f1 > > - tar1_f1.txt > > - tar2_f1.txt > > f2: > >- tar1_f2.txt > >- tar1_f2.txt > > > > Any help? > > > > > > > > -- > > Best Regards, > > Ayan Guha > > >

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread ayan guha
01.4 sec HDFS Read: > 5318569 HDFS Write: 46 SUCCESS > > Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec > > OK > > INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, > Cumulative CPU 101.4 sec > > INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec > > INFO : Ended Job = job_1463956731753_0005 > > INFO : MapReduce Jobs Launched: > > INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec > HDFS Read: 5318569 HDFS Write: 46 SUCCESS > > INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec > > INFO : Completed executing > command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); > Time taken: 142.525 seconds > > INFO : OK > > +-++---+---+--+ > > | c0 | c1 | c2 | c3 | > > +-++---+---+--+ > > | 1 | 1 | 5.0005E7 | 2.8867513459481288E7 | > > +-++---+---+--+ > > 1 row selected (142.744 seconds) > > > > OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with > Hive on Spark. So you can obviously gain pretty well by using Hive on Spark. > > > > Please also note that I did not use any vendor's build for this purpose. I > compiled Spark 1.3.1 myself. > > > > HTH > > > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com/ > > > > -- Best Regards, Ayan Guha

Tar File: On Spark

2016-05-19 Thread ayan guha
folders. f1 - tar1_f1.txt - tar2_f1.txt f2: - tar1_f2.txt - tar1_f2.txt Any help? -- Best Regards, Ayan Guha

Pyspark with non default hive table

2016-05-10 Thread ayan guha
Hi Can we write to non default hive table using pyspark?

Re: SparkSQL with large result size

2016-05-02 Thread ayan guha
How many executors are you running? Is your partition scheme ensures data is distributed evenly? It is possible that your data is skewed and one of the executors failing. Maybe you can try reduce per executor memory and increase partitions. On 2 May 2016 14:19, "Buntu Dev"

Re: partitioner aware subtract

2016-05-09 Thread ayan guha
How about outer join? On 9 May 2016 13:18, "Raghava Mutharaju" wrote: > Hello All, > > We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key > (number of partitions are same for both the RDDs). We would like to > subtract rdd2 from rdd1. > > The subtract

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread ayan guha
>>> I'm glad you've mentioned it. >>> >>> I think Cloudera (and Hortonworks?) guys are doing a great job with >>> bringing all the features of YARN to Spark and I think Spark on YARN >>> shines features-wise. >>> >>> I'm not in a position to compare YARN vs Mesos for their resource >>> management, but Spark on Mesos is certainly lagging behind Spark on >>> YARN regarding the features Spark uses off the scheduler backends -- >>> security, data locality, queues, etc. (or I might be simply biased >>> after having spent months with Spark on YARN mostly?). >>> >>> Jacek >>> >> >> > -- Best Regards, Ayan Guha

Re: Spark Thrift Server performance

2016-07-13 Thread ayan guha
, 2016 at 1:38 AM, Michael Segel <msegel_had...@hotmail.com> wrote: > Hey, silly question? > > If you’re running a load balancer, are you trying to reuse the RDDs > between jobs? > > TIA > -Mike > > On Jul 13, 2016, at 9:08 AM, ayan guha <guha.a...@gmail.com

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
it >> is taking 2 hours for inserting / upserting 5ooK records in parquet format >> in some hdfs location where each location gets mapped to one partition. >> >> My spark conf specs are : >> >> yarn cluster mode. single node. >> spark.executor.memory 8g >> spark.rpc.netty.dispatcher.numThreads 2 >> >> Thanks, >> Sumit >> >> >> > -- Best Regards, Ayan Guha

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
me see how HBase might efficiently > tackle this classic upsert case. > > Thanks, > Sumit > > On Fri, Jul 29, 2016 at 3:22 PM, ayan guha <guha.a...@gmail.com> wrote: > >> This is a classic case compared to hadoop vs DWH implmentation. >> >> Source (Delt

is Hadoop need to be installed?

2016-07-31 Thread ayan guha
not possible anymore? [image: Inline image 1] -- Best Regards, Ayan Guha

Re: Windows - Spark 2 - Standalone - Worker not able to connect to Master

2016-08-01 Thread ayan guha
No I confirmed master is running by spark ui at localhost:8080 On 1 Aug 2016 18:22, "Nikolay Zhebet" <phpap...@gmail.com> wrote: > I think you haven't run spark master yet, or maybe port 7077 is not yours > default port for spark master. > > 2016-08-01 4:24

Re: Hive and distributed sql engine

2016-07-25 Thread ayan guha
In order to use existing pg UDF, you may create a view in pg and expose the view to hive. Spark to database connection happens from each executors, so you must have a connection or a pool of connection per worker. Executors of the same worker can share connection pool. Best Ayan On 25 Jul 2016

Re: spark sql aggregate function "Nth"

2016-07-26 Thread ayan guha
You can use rank with window function. Rank=1 is same as calling first(). Not sure how you would randomly pick records though, if there is no Nth record. In your example, what happens if data is of only 2 rows? On 27 Jul 2016 00:57, "Alex Nastetsky" wrote: >

Re: Java Recipes for Spark

2016-07-29 Thread ayan guha
ava recipes for Apache Spark updated. >>> It's done here: http://jgp.net/2016/07/22/spark-java-recipes/ and in >>> the GitHub repo. >>> >>> Enjoy / have a great week-end. >>> >>> jg >>> >>> >>> >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- Best Regards, Ayan Guha

Re: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-29 Thread ayan guha
images = sc.binaryFiles("/myimages/*.jpg") > image_to_text = lambda rawdata: do_some_with_bytes(file_bytes(rawdata)) > print images.values().map(image_to_text).take(1) #this gives an error > > > What is the way to load this library? > > -- Best Regards, Ayan Guha

Re: How to filter based on a constant value

2016-07-30 Thread ayan guha
| 2015-12-15| XYZ LTD CD 4636 | 10.95| > +---+--+---+ > > Now if I want to use the var maxdate in place of "2015-12-15", how would I > do that? > > I tried lit(maxdate) etc but they are all giving me error? > > java.lang.RuntimeException: Unsupported literal type class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema > [2015-12-15] > > > Thanks > -- Best Regards, Ayan Guha

Re: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-30 Thread ayan guha
vant to these questions. > > Thanks again. > > > > On Sat, Jul 30, 2016 at 1:42 AM, Bhaarat Sharma <bhaara...@gmail.com> > wrote: > >> Great, let me give that a shot. >> >> On Sat, Jul 30, 2016 at 1:40 AM, ayan guha <guha.a...@gmail.com> wrote:

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread ayan guha
it possible to achieve what I'm after? I don't want to write files to > local file system and them put them in HDFS. Instead, I want to use the > saveAsTextFile method on the RDD directly. > > > -- Best Regards, Ayan Guha

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
ss.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable fo

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
erty which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 31 July 2016 at 10:36, ayan guha <guha.a...@gmail.com> wrote: &g

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising f

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
gt; <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other proper

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
w?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, d

Re: error while running filter on dataframe

2016-07-31 Thread ayan guha
It would help to share spark version, env details and code snippet. There are many very knowledgeable guys here who can then be able to help On 1 Aug 2016 02:15, "Tony Lane" wrote: > Can someone help me understand this error which occurs while running a > filter on a

Windows - Spark 2 - Standalone - Worker not able to connect to Master

2016-07-31 Thread ayan guha
) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread EventExecutor.java:111) ... 1 more Am I doing something wrong? -- Best Regards, Ayan Guha

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread ayan guha
gt;>> your responsibility) and if their optimatizations are correctly configured >>>> (min max index, bloom filter, compression etc) . >>>> >>>> If you need to ingest sensor data you may want to store it first in >>>> hbase and then batch process it in large files in Orc or parquet format. >>>> >>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> >>>> wrote: >>>> >>>> Just wondering advantages and disadvantages to convert data into ORC or >>>> Parquet. >>>> >>>> In the documentation of Spark there are numerous examples of Parquet >>>> format. >>>> >>>> Any strong reasons to chose Parquet over ORC file format ? >>>> >>>> Also : current data compression is bzip2 >>>> >>>> >>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy >>>> This seems like biased. >>>> >>>> >>> >> > -- Best Regards, Ayan Guha

Re: Does it has a way to config limit in query on STS by default?

2016-08-02 Thread ayan guha
Zeppelin already has a param for jdbc On 2 Aug 2016 19:50, "Mich Talebzadeh" wrote: > Ok I have already set up mine > > > hive.limit.optimize.fetch.max > 5 > > Maximum number of rows allowed for a smaller subset of data for > simple LIMIT,

Re: Extracting key word from a textual column

2016-08-02 Thread ayan guha
I would stay away from transaction tables until they are fully baked. I do not see why you need to update vs keep inserting with timestamp and while joining derive latest value on the fly. But I guess it has became a religious question now :) and I am not unbiased. On 3 Aug 2016 08:51, "Mich

<    1   2   3   4   5   6   7   8   >