Re: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-30 Thread ayan guha
vant to these questions. > > Thanks again. > > > > On Sat, Jul 30, 2016 at 1:42 AM, Bhaarat Sharma <bhaara...@gmail.com> > wrote: > >> Great, let me give that a shot. >> >> On Sat, Jul 30, 2016 at 1:40 AM, ayan guha <guha.a...@gmail.com> wrote:

Re: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-29 Thread ayan guha
images = sc.binaryFiles("/myimages/*.jpg") > image_to_text = lambda rawdata: do_some_with_bytes(file_bytes(rawdata)) > print images.values().map(image_to_text).take(1) #this gives an error > > > What is the way to load this library? > > -- Best Regards, Ayan Guha

Re: Java Recipes for Spark

2016-07-29 Thread ayan guha
ava recipes for Apache Spark updated. >>> It's done here: http://jgp.net/2016/07/22/spark-java-recipes/ and in >>> the GitHub repo. >>> >>> Enjoy / have a great week-end. >>> >>> jg >>> >>> >>> >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- Best Regards, Ayan Guha

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
me see how HBase might efficiently > tackle this classic upsert case. > > Thanks, > Sumit > > On Fri, Jul 29, 2016 at 3:22 PM, ayan guha <guha.a...@gmail.com> wrote: > >> This is a classic case compared to hadoop vs DWH implmentation. >> >> Source (Delt

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
it >> is taking 2 hours for inserting / upserting 5ooK records in parquet format >> in some hdfs location where each location gets mapped to one partition. >> >> My spark conf specs are : >> >> yarn cluster mode. single node. >> spark.executor.memory 8g >> spark.rpc.netty.dispatcher.numThreads 2 >> >> Thanks, >> Sumit >> >> >> > -- Best Regards, Ayan Guha

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread ayan guha
gt;>> your responsibility) and if their optimatizations are correctly configured >>>> (min max index, bloom filter, compression etc) . >>>> >>>> If you need to ingest sensor data you may want to store it first in >>>> hbase and then batch process it in large files in Orc or parquet format. >>>> >>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> >>>> wrote: >>>> >>>> Just wondering advantages and disadvantages to convert data into ORC or >>>> Parquet. >>>> >>>> In the documentation of Spark there are numerous examples of Parquet >>>> format. >>>> >>>> Any strong reasons to chose Parquet over ORC file format ? >>>> >>>> Also : current data compression is bzip2 >>>> >>>> >>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy >>>> This seems like biased. >>>> >>>> >>> >> > -- Best Regards, Ayan Guha

Re: spark sql aggregate function "Nth"

2016-07-26 Thread ayan guha
You can use rank with window function. Rank=1 is same as calling first(). Not sure how you would randomly pick records though, if there is no Nth record. In your example, what happens if data is of only 2 rows? On 27 Jul 2016 00:57, "Alex Nastetsky" wrote: >

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread ayan guha
>>> I'm glad you've mentioned it. >>> >>> I think Cloudera (and Hortonworks?) guys are doing a great job with >>> bringing all the features of YARN to Spark and I think Spark on YARN >>> shines features-wise. >>> >>> I'm not in a position to compare YARN vs Mesos for their resource >>> management, but Spark on Mesos is certainly lagging behind Spark on >>> YARN regarding the features Spark uses off the scheduler backends -- >>> security, data locality, queues, etc. (or I might be simply biased >>> after having spent months with Spark on YARN mostly?). >>> >>> Jacek >>> >> >> > -- Best Regards, Ayan Guha

Re: Hive and distributed sql engine

2016-07-25 Thread ayan guha
In order to use existing pg UDF, you may create a view in pg and expose the view to hive. Spark to database connection happens from each executors, so you must have a connection or a pool of connection per worker. Executors of the same worker can share connection pool. Best Ayan On 25 Jul 2016

Re: calculate time difference between consecutive rows

2016-07-21 Thread ayan guha
Please post your code and results. Lag will be null for the first record. Also, what data type you are using? Are you using cast? On 21 Jul 2016 14:28, "Divya Gehlot" wrote: > I have a dataset of time as shown below : > Time1 > 07:30:23 > 07:34:34 > 07:38:23 > 07:39:12 >

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread ayan guha
Just as a rain check, saving data to hbase for analytics may not be the best choice. Any specific reason for not using hdfs or hive? On 20 Jul 2016 20:57, "Rabin Banerjee" wrote: > Hi Wei , > > You can do something like this , > > foreachPartition( (part) => {

Re: Missing Exector Logs From Yarn After Spark Failure

2016-07-19 Thread ayan guha
tting etc in Yarn >> to a very large number. But still these logs are deleted as soon as the >> Spark applications are crashed. >> >> Question: How can we retain these Spark application logs in Yarn for >> debugging when the Spark application is crashed for some reason. >> > > -- Best Regards, Ayan Guha

Re: Little idea needed

2016-07-19 Thread ayan guha
Well this one keeps cropping up in every project especially when hadoop implemented alongside MPP. For the fact, there is no reliable out of box update operation available in hdfs or hive or SPARK. Hence, one approach is what Mitch suggested, that do not update. Rather just keep all source

Re: Inode for STS

2016-07-18 Thread ayan guha
wrote: > Hi Ayan, > I seem like you mention this > https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.start.cleanup.scratchdir > Default it was set false by default. > > > > > > On Jul 13, 2016, at 5:01 PM, ayan guha <

Re: Dataframe Transformation with Inner fields in Complex Datatypes.

2016-07-17 Thread ayan guha
CaseUDF($"address.line1.buildname")), it is resulting in a new > dataframe with new column: "address.line1.buildname" appended, with > toUpperCaseUDF values from inner field buildname. > > How can I update the inner fields of the complex data types. Kindly > suggest. > > Thanks in anticipation. > > Best Regards, > Naveen Kumar. > -- Best Regards, Ayan Guha

Re: Call http request from within Spark

2016-07-15 Thread ayan guha
Can you explain what do you mean by count never stops? On 15 Jul 2016 00:53, "Amit Dutta" wrote: > Hi All, > > > I have a requirement to call a rest service url for 300k customer ids. > > Things I have tried so far is > > > custid_rdd =

Re: Spark Thrift Server performance

2016-07-13 Thread ayan guha
, 2016 at 1:38 AM, Michael Segel <msegel_had...@hotmail.com> wrote: > Hey, silly question? > > If you’re running a load balancer, are you trying to reuse the RDDs > between jobs? > > TIA > -Mike > > On Jul 13, 2016, at 9:08 AM, ayan guha <guha.a...@gmail.com

Re: Spark Thrift Server performance

2016-07-13 Thread ayan guha
ps://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards, Ayan Guha

Spark 7736

2016-07-13 Thread ayan guha
Hi I am facing same issue reporting on Spark 7736 <https://issues.apache.org/jira/browse/SPARK-7736> on Spark 1.6.0. Is it any way to reopen the Jira? Reproduction steps attached. -- Best Regards, Ayan Guha Spark 7736.docx Description: MS-Word 2007 do

Re: Inode for STS

2016-07-13 Thread ayan guha
:25 PM, Christophe Préaud < christophe.pre...@kelkoo.com> wrote: > Hi Ayan, > > I have opened a JIRA about this issues, but there are no answer so far: > SPARK-15401 <https://issues.apache.org/jira/browse/SPARK-15401> > > Regards, > Christophe. > > > On 13/0

Inode for STS

2016-07-12 Thread ayan guha
sue. What is the best way to clean up those small files periodically? -- Best Regards, Ayan Guha

Fwd: Fast database with writes per second and horizontal scaling

2016-07-11 Thread ayan guha
a doc databases is pretty neat but not fast enough > > May main concern is fast writes per second and good scaling. > > > Hive on Spark or Tez? > > How about Hbase. or anything else > > Any expert advice warmly acknowledged.. > > thanking you > > > -- Best Regards, Ayan Guha

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread ayan guha
- >>> Compile Query 0.21s >>> Prepare Plan0.13s >>> Submit Plan 0.34s >>> Start DAG 0.23s >>> Run DAG 0.71s >>> >>> --- >>> --- >>> >>> Task Execution Summary >>> >>> --- >>> --- >>> VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS >>> OUTPUT_RECORDS >>> >>> --- >>> --- >>> Map 1 604.00 00 59,957,438 >>> 13 >>> Reducer 2 105.00 00 13 >>>0 >>> >>> --- >>> --- >>> >>> LLAP IO Summary >>> >>> --- >>> --- >>> VERTICES ROWGROUPS META_HIT META_MISS DATA_HIT DATA_MISS >>> ALLOCATION >>> USED TOTAL_IO >>> >>> --- >>> --- >>> Map 1 6036 01460B68.86MB >>> 491.00MB >>> 479.89MB 7.94s >>> >>> --- >>> --- >>> >>> OK >>> 0.1 >>> Time taken: 1.669 seconds, Fetched: 1 row(s) >>> hive(tpch_flat_orc_10)> >>> >>> >>> This is running against a single 16 core box & I would assume it would >>> take <1.4s to read twice as much (13 tasks is barely touching the load >>> factors). >>> >>> It would probably be a bit faster if the cache had hits, but in general >>> 14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is. >>> >>> Cheers, >>> Gopal >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> > -- Best Regards, Ayan Guha

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-11 Thread ayan guha
issues.apache.org/jira/browse/SPARK-2243 > > // maropu > > > On Mon, Jul 11, 2016 at 12:01 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on >> YARN for few months now

Re: "client / server" config

2016-07-10 Thread ayan guha
t;j...@jgp.net> wrote: > Good for the file :) > > No it goes on... Like if it was waiting for something > > jg > > > On Jul 10, 2016, at 22:55, ayan guha <guha.a...@gmail.com> wrote: > > Is this terminating the execution or spark application still runs

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread ayan guha
al mode If I deploy on mesos cluster what would > happened? > > Need you guys suggests some solutions for that. Thanks. > > Chanh > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: "client / server" config

2016-07-10 Thread ayan guha
.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109) > at scala.Option.getOrElse(Option.scala:120) > > Do I have to install HADOOP on the server? - I imagine that from: > java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set. > > TIA, > > jg > > -- Best Regards, Ayan Guha

Re: Spark as sql engine on S3

2016-07-07 Thread ayan guha
> On Friday, 8 July 2016, 5:27, ayan guha <guha.a...@gmail.com> wrote: > > > Spark Thrift Server..works as jdbc server. you can connect to it from > any jdbc tool like squirrel > > On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar <ashok34...@yahoo.com.invalid> > wrote

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread ayan guha
se is there any way to share things between too? I mean > Zeppelin and Thift Server. For example: I update, insert data to a table on > Zeppelin and external tool connect through STS can get it. > > Thanks & regards, > Chanh > > On Jul 8, 2016, at 11:21 AM, ayan guha <guha.a...

Re: Spark as sql engine on S3

2016-07-07 Thread ayan guha
best way to use Spark as SQL engine to access data > on S3? > > Any info/write up will be greatly appreciated. > > Regards > -- Best Regards, Ayan Guha

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread ayan guha
>> Our architecture is Spark, Alluxio, Zeppelin. Because We want to share >> what we have done in Zeppelin to business users. >> >> Is there any way to do that? >> >> Thanks. >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > -- Best Regards, Ayan Guha

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread ayan guha
gt; |20| 1006| abh| null| null| > |20| 1009| abl| null|null| > |30| 1004| abf| null|null| > |30| 1008| abk| null|null| > +--+--++--++ > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Left-outer-Join-issue-using-programmatic-sql-joins-tp27295.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: add multiple columns

2016-06-26 Thread ayan guha
when you have multiple i have > to loop on eache columns ? > > > > thanks > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Partitioning in spark

2016-06-23 Thread ayan guha
col4 then why does it shuffle everything whereas > it need to sort each partitions and then should grouping there itself. > > Bit confusing , I am using 1.5.1 > > Is it fixed in future versions. > > Thanks > -- Best Regards, Ayan Guha

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread ayan guha
> could be stored in Hive yet your only access method is via a JDBC or > Thift/Rest service. Think also of compute / storage cluster > implementations. > > WRT to #2, not exactly what I meant, by exposing the data… and there are > limitations to the thift service… > > On Jun 2

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread ayan guha
ultimately resides. There really is > a method to my madness, and if I could explain it… these questions really > would make sense. ;-) > > TIA, > > -Mike > > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Python to Scala

2016-06-18 Thread ayan guha
;>>> will cut the effort of learning scala. >>>>> >>>>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html >>>>> >>>>> - Thanks, via mobile, excuse brevity. >>>>> On Jun 18, 2016 2:34 PM, "Aakash Basu" <raj2coo...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I've a python code, which I want to convert to Scala for using it in >>>>>> a Spark program. I'm not so well acquainted with python and learning >>>>>> scala >>>>>> now. Any Python+Scala expert here? Can someone help me out in this >>>>>> please? >>>>>> >>>>>> Thanks & Regards, >>>>>> Aakash. >>>>>> >>>>> >>> -- Best Regards, Ayan Guha

Re: Spark SQL Errors

2016-05-31 Thread ayan guha
gt; > > http://talebzadehmich.wordpress.com > > > > On 31 May 2016 at 06:31, ayan guha <guha.a...@gmail.com> wrote: > >> No there is no semicolon. >> >> This is the query: >> >> 16/05/31 14:34:29 INFO SparkExecuteStatementOperation: Running

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread ayan guha
01.4 sec HDFS Read: > 5318569 HDFS Write: 46 SUCCESS > > Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec > > OK > > INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, > Cumulative CPU 101.4 sec > > INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec > > INFO : Ended Job = job_1463956731753_0005 > > INFO : MapReduce Jobs Launched: > > INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec > HDFS Read: 5318569 HDFS Write: 46 SUCCESS > > INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec > > INFO : Completed executing > command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); > Time taken: 142.525 seconds > > INFO : OK > > +-++---+---+--+ > > | c0 | c1 | c2 | c3 | > > +-++---+---+--+ > > | 1 | 1 | 5.0005E7 | 2.8867513459481288E7 | > > +-++---+---+--+ > > 1 row selected (142.744 seconds) > > > > OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with > Hive on Spark. So you can obviously gain pretty well by using Hive on Spark. > > > > Please also note that I did not use any vendor's build for this purpose. I > compiled Spark 1.3.1 myself. > > > > HTH > > > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com/ > > > > -- Best Regards, Ayan Guha

Re: How to perform reduce operation in the same order as partition indexes

2016-05-19 Thread ayan guha
You can add the index from mappartitionwithindex in the output and order based on that in merge step On 19 May 2016 13:22, "Pulasthi Supun Wickramasinghe" wrote: > Hi Devs/All, > > I am pretty new to Spark. I have a program which does some map reduce > operations with

Re: Tar File: On Spark

2016-05-19 Thread ayan guha
f2.txt > > tar2: > > - f1.txt > > - f2.txt > > > > (each tar file will have exact same number of files, same name) > > > > I am trying to find a way (spark or pig) to extract them to their own > folders. > > > > f1 > > - tar1_f1.txt > > - tar2_f1.txt > > f2: > >- tar1_f2.txt > >- tar1_f2.txt > > > > Any help? > > > > > > > > -- > > Best Regards, > > Ayan Guha > > >

Tar File: On Spark

2016-05-19 Thread ayan guha
folders. f1 - tar1_f1.txt - tar2_f1.txt f2: - tar1_f2.txt - tar1_f2.txt Any help? -- Best Regards, Ayan Guha

Pyspark with non default hive table

2016-05-10 Thread ayan guha
Hi Can we write to non default hive table using pyspark?

Re: partitioner aware subtract

2016-05-09 Thread ayan guha
How about outer join? On 9 May 2016 13:18, "Raghava Mutharaju" wrote: > Hello All, > > We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key > (number of partitions are same for both the RDDs). We would like to > subtract rdd2 from rdd1. > > The subtract

Re: SparkSQL with large result size

2016-05-02 Thread ayan guha
How many executors are you running? Is your partition scheme ensures data is distributed evenly? It is possible that your data is skewed and one of the executors failing. Maybe you can try reduce per executor memory and increase partitions. On 2 May 2016 14:19, "Buntu Dev"

Re: Sqoop on Spark

2016-04-05 Thread ayan guha
What you > guys think? > > On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Why do you want to reimplement something which is already there? >> >> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote: >> >&g

Sqoop on Spark

2016-04-05 Thread ayan guha
Hi All Asking opinion: is it possible/advisable to use spark to replace what sqoop does? Any existing project done in similar lines? -- Best Regards, Ayan Guha

Spark thrift issue 8659 (changing subject)

2016-03-23 Thread ayan guha
> > Hi All > > I found this issue listed in Spark Jira - > https://issues.apache.org/jira/browse/SPARK-8659 > > I would love to know if there are any roadmap for this? Maybe someone from > dev group can confirm? > > Thank you in advance > > Best > Ayan > >

Re: Zeppelin Integration

2016-03-23 Thread ayan guha
, 2016 at 10:32 PM, ayan guha <guha.a...@gmail.com> wrote: > Thanks guys for reply. Yes, Zeppelin with Spark is pretty compelling > choice, for single user. Any pointers for using Zeppelin for multi user > scenario? In essence, can we either (a) Use Zeppelin to connect to a long

Re: Spark Thriftserver

2016-03-16 Thread ayan guha
> It's same as hive thrift server. I believe kerberos is supported. > > On Wed, Mar 16, 2016 at 10:48 AM, ayan guha <guha.a...@gmail.com> wrote: > >> so, how about implementing security? Any pointer will be helpful >> >> On Wed, Mar 16, 2016 at 1:

Re: Zeppelin Integration

2016-03-10 Thread ayan guha
ou use a select query, the output is >> automatically displayed as a chart. >> >> As RDDs are bound to the context that creates them, I don't think >> Zeppelin can use those RDDs. >> >> I don't know if notebooks can be reused within other notebooks. It would >&g

Zeppelin Integration

2016-03-10 Thread ayan guha
is not a good choice, yet, for the use case, what are the other alternatives? appreciate any help/pointers/guidance. -- Best Regards, Ayan Guha

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread ayan guha
gt; On Tue, Mar 8, 2016 at 8:50 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Why not compare current time in every batch and it meets certain >> condition emit the data? >> On 9 Mar 2016 00:19, "Abhishek Anand" <abhis.anan...@gmail.com> wrote: >&

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread ayan guha
Why not compare current time in every batch and it meets certain condition emit the data? On 9 Mar 2016 00:19, "Abhishek Anand" wrote: > I have a spark streaming job where I am aggregating the data by doing > reduceByKeyAndWindow with inverse function. > > I am keeping

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-04 Thread ayan guha
java:111) >>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>> 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch >>>>>> (1/3) for 6 outstanding blocks after 5000 ms >>>>>> 16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive >>>>>> connection to maprnode5, creating a new one. >>>>>> 16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in >>>>>> connection from maprnode5 >>>>>> java.io.IOException: Connection reset by peer >>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >>>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192) >>>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >>>>>> at >>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) >>>>>> at >>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) >>>>>> at >>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) >>>>>> at >>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) >>>>>> at >>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) >>>>>> at >>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) >>>>>> at >>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) >>>>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) >>>>>> at >>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) >>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>> 16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have 1 >>>>>> requests outstanding when connection from maprnode5 is closed >>>>>> 16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while >>>>>> starting block fetches >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> [image: What's New with Xactly] >>>>> <http://www.xactlycorp.com/email-click/> >>>>> >>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>> <http://www.youtube.com/xactlycorporation> >>>>> >>>>> >>>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >>> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> > -- Best Regards, Ayan Guha

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread ayan guha
`user.timestamp` as > rawTimeStamp, `user.requestId` as requestId, > *floor(`user.timestamp`/72000*) as timeBucket FROM logs"); > bucketLogs.toJSON().saveAsTextFile("target_file"); > > Regards > Ashok > -- Best Regards, Ayan Guha

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread ayan guha
ut some other knowledgeable people on the list, please chime >>> in). Two, since Spark is written in Scala, it gives you an enormous >>> advantage to read sources (which are well documented and highly readable) >>> should you have to consult or learn nuances of certain API method or action >>> not covered comprehensively in the docs. And finally, there’s a long term >>> benefit in learning Scala for reasons other than Spark. For example, >>> writing other scalable and distributed applications. >>> >>> >>> Particularly, we will be using Spark Streaming. I know a couple of years >>> ago that practically forced the decision to use Scala. Is this still the >>> case? >>> >>> >>> You’ll notice that certain APIs call are not available, at least for >>> now, in Python. >>> http://spark.apache.org/docs/latest/streaming-programming-guide.html >>> >>> >>> Cheers >>> Jules >>> >>> -- >>> The Best Ideas Are Simple >>> Jules S. Damji >>> e-mail:dmat...@comcast.net >>> e-mail:jules.da...@gmail.com >>> >>> ​ > -- Best Regards, Ayan Guha

Re: Spark Integration Patterns

2016-02-28 Thread ayan guha
g a process on a remote host >>>> to execute a shell script seems like a lot of effort What are the >>>> recommended ways to connect and query Spark from a remote client ? Thanks >>>> Thx ! >>>> -- >>>> View this message in context: Spark Integration Patterns >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354.html> >>>> Sent from the Apache Spark User List mailing list archive >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >>>> >>> >>> > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ > -- Best Regards, Ayan Guha

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-26 Thread ayan guha
of now. See this ticket > <https://issues.apache.org/jira/browse/SPARK-4226> for more on this. > > > > [image: http://] > > Tariq, Mohammad > about.me/mti > [image: http://] > <http://about.me/mti> > > > On Fri, Feb 26, 2016 at 7:01 AM, ayan guha &l

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread ayan guha
n (select max(column1) from mytable) > > Thanks > -- Best Regards, Ayan Guha

Re: Stream group by

2016-02-21 Thread ayan guha
>>> t1, file1, 1, 1, 1 >>>>> t1, file1, 1, 2, 3 >>>>> t1, file2, 2, 2, 2, 2 >>>>> t2, file1, 5, 5, 5 >>>>> t2, file2, 1, 1, 2, 2 >>>>> >>>>> and i want to achieve the output like below rows which is a vertical >>>>> addition of the corresponding numbers. >>>>> >>>>> *Output* >>>>> “file1” : [ 1+1+5, 1+2+5, 1+3+5 ] >>>>> “file2” : [ 2+1, 2+1, 2+2, 2+2 ] >>>>> >>>>> I am in a spark streaming context and i am having a hard time trying >>>>> to figure out the way to group by file name. >>>>> >>>>> It seems like i will need to use something like below, i am not sure >>>>> how to get to the correct syntax. Any inputs will be helpful. >>>>> >>>>> myDStream.foreachRDD(rdd => rdd.groupBy()) >>>>> >>>>> I know how to do the vertical sum of array of given numbers, but i am >>>>> not sure how to feed that function to the group by. >>>>> >>>>> def compute_counters(counts : ArrayBuffer[List[Int]]) = { >>>>> counts.toList.transpose.map(_.sum) >>>>> } >>>>> >>>>> ~Thanks, >>>>> Vinti >>>>> >>>> >>>> >>> >> > -- Best Regards, Ayan Guha

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread ayan guha
to read from Kafka there > are 5 tasks writing to E/S. So I'm supposing that the task reading from > Kafka does it in // using 5 partitions and that's why there are then 5 > tasks to write to E/S. But I'm supposing ... > > On Feb 16, 2016, at 21:12, ayan guha <guha.a...@gmail.com>

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread ayan guha
or management of > Spark resources) ? > > Thank you > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread ayan guha
Why can't you use the jdbc in hive context? I don't think sharing data across contexts are allowed. On 15 Feb 2016 07:22, "Mich Talebzadeh" wrote: > I am intending to get a table from Hive and register it as temporary table > in Spark. > > > > I have created contexts for

Re: Spark Error: Not enough space to cache partition rdd

2016-02-14 Thread ayan guha
Have you tried repartition to larger number of partitions? Also, I would suggest increase number of executors and give them smaller amount of memory each. On 15 Feb 2016 06:49, "gustavolacerdas" wrote: > I have a machine with 96GB and 24 cores. I'm trying to run a

Re: Spark Certification

2016-02-14 Thread ayan guha
Thanks. Do we have any forum or study group for certification aspirants? I would like to join. On 15 Feb 2016 05:53, "Olivier Girardot" wrote: > It does not contain (as of yet) anything > 1.3 (for example in depth > knowledge of the Dataframe API) > but you need

Re: AM creation in yarn client mode

2016-02-09 Thread ayan guha
created on the client where the job was submitted? i.e driver and > AM on the same client? > Or > B) yarn decides where the the AM should be created? > > 2) Driver and AM run in different processes : is my assumption correct? > > Regards, > Praveen > -- Best Regards, Ayan Guha

Re: is Hbase Scan really need thorough Get (Hbase+solr+spark)

2016-01-19 Thread ayan guha
Value("rowkey"))); > list.add(get); > > } > > *Result[] res = table.get(list);//This is really need? because it takes > extra time to scan right?* > This piece of code i got from > http://www.programering.com/a/MTM5kDMwATI.html > > please correct if anything wrong :) > > Thanks > Beesh > > -- Best Regards, Ayan Guha

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-12 Thread ayan guha
nfiguration().set("fs.s3.awsAccessKeyId", "") >>> sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", >>> "") >>> >>> 2. Set keys in URL, e.g.: >>> sc.textFile("s3n://@/bucket/test/testdata") >>> >>> >>> Both if which I'm reluctant to do within production code! >>> >>> >>> Cheers >>> >> >> -- Best Regards, Ayan Guha

Re: pre-install 3-party Python package on spark cluster

2016-01-12 Thread ayan guha
Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > -- Best Regards, Ayan Guha

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread ayan guha
gt;>>>>> step, which should be another 6.2TB shuffle read. >>>>>> >>>>>> I think to Dedup, the shuffling can not be avoided. Is there anything >>>>>> I could do to stablize this process? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <yue.yuany...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hey, >>>>>>> >>>>>>> I got everyday's Event table and want to merge them into a single >>>>>>> Event table. But there so many duplicates among each day's data. >>>>>>> >>>>>>> I use Parquet as the data source. What I am doing now is >>>>>>> >>>>>>> EventDay1.unionAll(EventDay2).distinct().write.parquet("a new >>>>>>> parquet file"). >>>>>>> >>>>>>> Each day's Event is stored in their own Parquet file >>>>>>> >>>>>>> But it failed at the stage2 which keeps losing connection to one >>>>>>> executor. I guess this is due to the memory issue. >>>>>>> >>>>>>> Any suggestion how I do this efficiently? >>>>>>> >>>>>>> Thanks, >>>>>>> Gavin >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > -- Best Regards, Ayan Guha

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread ayan guha
frame: > > a | b | c > -- > 1 | 1 | 1 > 2 | 1 | 4 > 3 | 1 | 7 > -- > > The dataframe I have is huge so get the minimum value of b from each group > and joining on the original dataframe is very expensive. Is there a better > way to do this? > > > Thanks, > Wei > > -- Best Regards, Ayan Guha

Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread ayan guha
name would keep unchanged. > Just need finish it in spark program, but not hdfs commands. > Is there any codes, it seems not to be done by searching spark doc ... > > Thanks in advance! > -- Best Regards, Ayan Guha

Re: Merge rows into csv

2015-12-08 Thread ayan guha
-- > ID STATE > - > 1 TX > 1NY > 1FL > 2CA > 2OH > - > > This is the required output: > - > IDCSV_STATE > - > 1 TX,NY,FL > 2 CA,OH > - > -- Best Regards, Ayan Guha

Re: how create hbase connect?

2015-12-07 Thread ayan guha
base on Rdd? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread ayan guha
One more thing I feel for better maintability would be to create a dB view and then use the view in spark. This will avoid burying complicated SQL queries within application code. On 8 Dec 2015 05:55, "Wang, Ningjun (LNG-NPV)" wrote: > This is a very helpful article.

Re: Experiences about NoSQL databases with Spark

2015-12-06 Thread ayan guha
age in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462p25594.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> ----- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- Best Regards, Ayan Guha

Re: sparkavro for PySpark 1.3

2015-12-05 Thread ayan guha
Loader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 26 more > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/sparkavro-for-PySpark-1-3-tp25561p25574.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread ayan guha
, I am using spark 1.3.0 with CDH 5.4 > > [image: Inline image 1] > > > > Thanks > Gokul > > -- Best Regards, Ayan Guha

Re: Low Latency SQL query

2015-12-01 Thread ayan guha
You can try query push down by creating the query while creating the rdd. On 2 Dec 2015 12:32, "Fengdong Yu" wrote: > It depends on many situations: > > 1) what’s your data format? csv(text) or ORC/parquet? > 2) Did you have Data warehouse to summary/cluster your

Re: spark rdd grouping

2015-12-01 Thread ayan guha
> PairRdd is basically constrcuted using kafka streaming low level consumer > > which have all records with same key already in same partition. Can i > group > > them together with avoid shuffle. > > > > Thanks > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-11 Thread ayan guha
2* > Website: www.ambodi.com > Twitter: @_ambodi <https://twitter.com/_ambodi> > -- Best Regards, Ayan Guha

Re: How to analyze weather data in Spark?

2015-11-08 Thread ayan guha
List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Assign unique link ID

2015-10-31 Thread ayan guha
ng below line - > *var src_link = src_join.as > <http://src_join.as>("SJ").withColumn("LINK_ID", > linkIDUDF(src_join("S1.RECORD_ID"),src("S2.RECORD_ID")));* > Then in further lines I'm not able to refer to "s1" columns from > "src_link" like - > *var src_link_s1 = src_link.as > <http://src_link.as>("SL").select($"S1.RECORD_ID");* > > Please guide me. > > Regards, > Sarath. > -- Best Regards, Ayan Guha

Re: Pivot Data in Spark and Scala

2015-10-31 Thread ayan guha
e transmission exempte de tout > virus, l'expéditeur ne donne aucune garantie à cet égard et sa > responsabilité ne saurait être recherchée pour tout dommage résultant d'un > virus transmis. > > This e-mail and the documents attached are confidential and intended > solely for the addressee; it may also be privileged. If you receive this > e-mail in error, please notify the sender immediately and destroy it. As > its integrity cannot be secured on the Internet, the Worldline liability > cannot be triggered for the message content. Although the sender endeavours > to maintain a computer virus-free network, the sender does not warrant that > this transmission is virus-free and will not be liable for any damages > resulting from any virus transmitted. > > > -- Best Regards, Ayan Guha

Re: Programatically create RDDs based on input

2015-10-31 Thread ayan guha
gt; Hi > > I need the ability to be able to create RDDs programatically inside my > program (e.g. based on varaible number of input files). > > Can this be done? > > I need this as I want to run the following statement inside an iteration: > > JavaRDD rdd1 = jsc.textFile(&quo

Re: Assign unique link ID

2015-10-31 Thread ayan guha
ow to tackle this? > > Regards, > Sarath. > > On Sat, Oct 31, 2015 at 4:37 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Can this be a solution? >> >> 1. Write a function which will take a string and convert to md5 hash >> 2. From your base table, ge

Re: Programatically create RDDs based on input

2015-10-31 Thread ayan guha
> JavaRDD jRDD[] = new JavaRDD[3]; > > //Error: Cannot create a generic array of JavaRDD > > Thanks > Amit > > > > On Sat, Oct 31, 2015 at 5:46 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Corrected a typo... >> >> # In Driver >> fileList=

Re: Using Hadoop Custom Input format in Spark

2015-10-27 Thread ayan guha
Mind sharing the error you are getting? On 28 Oct 2015 03:53, "Balachandar R.A." wrote: > Hello, > > > I have developed a hadoop based solution that process a binary file. This > uses classic hadoop MR technique. The binary file is about 10GB and divided > into 73 HDFS

Re: spark multi tenancy

2015-10-07 Thread ayan guha
Can queues also be used to separate workloads? On 7 Oct 2015 20:34, "Steve Loughran" wrote: > > > On 7 Oct 2015, at 09:26, Dominik Fries > wrote: > > > > Hello Folks, > > > > We want to deploy several spark projects and want to use a unique

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread ayan guha
il: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: HDFS small file generation problem

2015-09-27 Thread ayan guha
y) by adding on the fly my > event ? > > Tks a lot > Nicolas > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Py4j issue with Python Kafka Module

2015-09-23 Thread ayan guha
..@gmail.com> > wrote: > >> I think it is something related to class loader, the behavior is >> different for classpath and --jars. If you want to know the details I think >> you'd better dig out some source code. >> >> Thanks >> Jerry >> &

Re: Join over many small files

2015-09-23 Thread ayan guha
ficient query over such dataframe? > > > > Any advice will be appreciated. > > > > Best regards, > > Lucas > > > > == > Please access the attached hyperlink for an important electronic > communications disclaimer: > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > == > -- Best Regards, Ayan Guha

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
gContext, class java.util.HashMap, class >> java.util.HashSet, >> class java.util.HashMap]) does not exist >> at >> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) >> >> at >> py4j.reflection.Reflection

Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) >>> Am I doing something wrong? -- Best Regards, Ayan Guha

Re: Spark Ingestion into Relational DB

2015-09-21 Thread ayan guha
- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Relational Log Data

2015-09-15 Thread ayan guha
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha

Re: Spark Streaming Suggestion

2015-09-15 Thread ayan guha
I think you need to make up your mind about storm vs spark. Using both in this context does not make much sense to me. On 15 Sep 2015 22:54, "David Morales" wrote: > Hi there, > > This is exactly our goal in Stratio Sparkta, a real-time aggregation > engine fully developed

Re: Directly reading data from S3 to EC2 with PySpark

2015-09-15 Thread ayan guha
Also you can set hadoop conf through jsc.hadoopConf property. Do a dir (sc) to see exact property name On 15 Sep 2015 22:43, "Gourav Sengupta" wrote: > Hi, > > If you start your EC2 nodes with correct roles (default in most cases > depending on your needs) you should

<    1   2   3   4   5   6   7   8   >