Can queues also be used to separate workloads?
On 7 Oct 2015 20:34, "Steve Loughran" wrote:
>
> > On 7 Oct 2015, at 09:26, Dominik Fries
> wrote:
> >
> > Hello Folks,
> >
> > We want to deploy several spark projects and want to use a unique
il: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
Do you have a benchmark to say running these two statements as it is will
be slower than what you suggest?
On 9 Jul 2015 01:06, Brandon White bwwintheho...@gmail.com wrote:
The point of running them in parallel would be faster creation of the
tables. Has anybody been able to efficiently
Can you please post result of show()?
On 10 Jul 2015 01:00, Yana Kadiyska yana.kadiy...@gmail.com wrote:
Hi folks, I just re-wrote a query from using UNION ALL to use with
rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed
but wanted to check if this is user error. Here
).
Is there is any scaling is there to decide what technology is best?either
SQL or SPARK?
On Thu, Jul 9, 2015 at 9:40 AM, ayan guha guha.a...@gmail.com wrote:
It depends on workload. How much data you would want to process?
On 9 Jul 2015 22:28, vinod kumar vinodsachin...@gmail.com wrote:
Hi Everyone
ordering within
batches* .But i doubt is there any change from old spark versions to
spark 1.4 in this context.
Any Comments please !!
--
Thanks Regards,
Anshu Shukla
--
Best Regards,
Ayan Guha
SSH by default should be on port 22. 7456 is the port is where master is
listening. So any spark app should be able to connect to master using that
port.
On 11 Jul 2015 13:50, ashishdutt ashish.du...@gmail.com wrote:
Hello all,
In my lab a colleague installed and configured spark 1.3.0 on a 4
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
Wang wbi...@gmail.com wrote:
I'm writing a streaming application and want to use spark-submit to submit
it to a YARN cluster. I'd like to submit it in a client node and exit
spark-submit after the application is running. Is it possible?
--
Best Regards,
Ayan Guha
to the spark history server.
When I run spark-shell master ip: port number I get the following output
How can I verify that the worker is connected to the master?
Thanks,
Ashish
--
Best Regards,
Ayan Guha
it using
sqlContext.udf.register,but when I restarted a service the UDF was not
available.
I've heared that Hive UDF's are permanently stored in hive.(Please Correct
me if I am wrong).
Thanks,
Vinod
--
Best Regards,
Ayan Guha
are the databases currently supported by Spark JDBC relation provider?
rgds
--
Niranda
@n1r44 https://twitter.com/N1R44
https://pythagoreanscript.wordpress.com/
--
Best Regards,
Ayan Guha
It depends on workload. How much data you would want to process?
On 9 Jul 2015 22:28, vinod kumar vinodsachin...@gmail.com wrote:
Hi Everyone,
I am new to spark.
Am using SQL in my application to handle data in my application.I have a
thought to move to spark now.
Is data processing speed
a 'describe table' from SparkSQL CLI it seems to
try looking at all records at the table (which takes a really long time for
big table) instead of just giving me the metadata of the table. Would
appreciate if someone can give me some pointers, thanks!
--
Best Regards,
Ayan Guha
)
at
org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274)
Does spark support this Hive function posexplode? If not, how to patch
it to support this? I am on Spark 1.3.1
Thanks,
Jeff Li
--
Best Regards,
Ayan Guha
gt;> implements similar logic in your Pig UDF.
>>>
>>> Both approaches look similar.
>>>
>>> Personally, I would go with Spark solution, it will be slightly faster,
>>> and easier if you already have Spark cluster setup on top of your hadoop
>
elevancy scores.
>
>
> You can use also Spark and Pig there. However, I am not sure if Spark is
> suitable for these one row lookups. Same holds for Pig.
>
>
> Le mer. 2 sept. 2015 à 23:53, ayan guha <guha.a...@gmail.com> a écrit :
>
> Hello group
>
> I am t
Hello group
I am trying to use pig or spark in order to achieve following:
1. Write a batch process which will read from a file
2. Lookup hbase to see if the record exists. If so then need to compare
incoming values with hbase and update fields which do not match. Else
create a new record.
My
; println("Count is"+count)
>> println("First is"+firstElement)
>>
>> Now, rdd2.count launches job0 with 1 task and rdd2.first launches job1
>> with 1 task. Here in job2, when calculating rdd.first, is the entire
>> lineage computed again or else as job0 already computes rdd2, is it reused
>> ???
>>
>> Thanks,
>> Padma Ch
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
--
Best Regards,
Ayan Guha
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
I think you need to make up your mind about storm vs spark. Using both in
this context does not make much sense to me.
On 15 Sep 2015 22:54, "David Morales" wrote:
> Hi there,
>
> This is exactly our goal in Stratio Sparkta, a real-time aggregation
> engine fully developed
Also you can set hadoop conf through jsc.hadoopConf property. Do a dir (sc)
to see exact property name
On 15 Sep 2015 22:43, "Gourav Sengupta" wrote:
> Hi,
>
> If you start your EC2 nodes with correct roles (default in most cases
> depending on your needs) you should
y) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
..@gmail.com>
> wrote:
>
>> I think it is something related to class loader, the behavior is
>> different for classpath and --jars. If you want to know the details I think
>> you'd better dig out some source code.
>>
>> Thanks
>> Jerry
>>
&
-
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
gContext, class java.util.HashMap, class
>> java.util.HashSet,
>> class java.util.HashMap]) does not exist
>> at
>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>
>> at
>> py4j.reflection.Reflection
py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
>>>
Am I doing something wrong?
--
Best Regards,
Ayan Guha
ficient query over such dataframe?
>
>
>
> Any advice will be appreciated.
>
>
>
> Best regards,
>
> Lucas
>
>
>
> ==
> Please access the attached hyperlink for an important electronic
> communications disclaimer:
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>
> ==
>
--
Best Regards,
Ayan Guha
, I am using spark 1.3.0 with CDH 5.4
>
> [image: Inline image 1]
>
>
>
> Thanks
> Gokul
>
>
--
Best Regards,
Ayan Guha
age in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462p25594.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -----
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
--
Best Regards,
Ayan Guha
base on Rdd?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
--
> ID STATE
> -
> 1 TX
> 1NY
> 1FL
> 2CA
> 2OH
> -
>
> This is the required output:
> -
> IDCSV_STATE
> -
> 1 TX,NY,FL
> 2 CA,OH
> -
>
--
Best Regards,
Ayan Guha
Loader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 26 more
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/sparkavro-for-PySpark-1-3-tp25561p25574.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
You can try query push down by creating the query while creating the rdd.
On 2 Dec 2015 12:32, "Fengdong Yu" wrote:
> It depends on many situations:
>
> 1) what’s your data format? csv(text) or ORC/parquet?
> 2) Did you have Data warehouse to summary/cluster your
One more thing I feel for better maintability would be to create a dB view
and then use the view in spark. This will avoid burying complicated SQL
queries within application code.
On 8 Dec 2015 05:55, "Wang, Ningjun (LNG-NPV)"
wrote:
> This is a very helpful article.
name would keep unchanged.
> Just need finish it in spark program, but not hdfs commands.
> Is there any codes, it seems not to be done by searching spark doc ...
>
> Thanks in advance!
>
--
Best Regards,
Ayan Guha
frame:
>
> a | b | c
> --
> 1 | 1 | 1
> 2 | 1 | 4
> 3 | 1 | 7
> --
>
> The dataframe I have is huge so get the minimum value of b from each group
> and joining on the original dataframe is very expensive. Is there a better
> way to do this?
>
>
> Thanks,
> Wei
>
>
--
Best Regards,
Ayan Guha
gt;>>>>> step, which should be another 6.2TB shuffle read.
>>>>>>
>>>>>> I think to Dedup, the shuffling can not be avoided. Is there anything
>>>>>> I could do to stablize this process?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <yue.yuany...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> I got everyday's Event table and want to merge them into a single
>>>>>>> Event table. But there so many duplicates among each day's data.
>>>>>>>
>>>>>>> I use Parquet as the data source. What I am doing now is
>>>>>>>
>>>>>>> EventDay1.unionAll(EventDay2).distinct().write.parquet("a new
>>>>>>> parquet file").
>>>>>>>
>>>>>>> Each day's Event is stored in their own Parquet file
>>>>>>>
>>>>>>> But it failed at the stage2 which keeps losing connection to one
>>>>>>> executor. I guess this is due to the memory issue.
>>>>>>>
>>>>>>> Any suggestion how I do this efficiently?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Gavin
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
--
Best Regards,
Ayan Guha
> PairRdd is basically constrcuted using kafka streaming low level consumer
> > which have all records with same key already in same partition. Can i
> group
> > them together with avoid shuffle.
> >
> > Thanks
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
nfiguration().set("fs.s3.awsAccessKeyId", "")
>>> sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey",
>>> "")
>>>
>>> 2. Set keys in URL, e.g.:
>>> sc.textFile("s3n://@/bucket/test/testdata")
>>>
>>>
>>> Both if which I'm reluctant to do within production code!
>>>
>>>
>>> Cheers
>>>
>>
>>
--
Best Regards,
Ayan Guha
Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
--
Best Regards,
Ayan Guha
gt;
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 31 May 2016 at 06:31, ayan guha <guha.a...@gmail.com> wrote:
>
>> No there is no semicolon.
>>
>> This is the query:
>>
>> 16/05/31 14:34:29 INFO SparkExecuteStatementOperation: Running
> could be stored in Hive yet your only access method is via a JDBC or
> Thift/Rest service. Think also of compute / storage cluster
> implementations.
>
> WRT to #2, not exactly what I meant, by exposing the data… and there are
> limitations to the thift service…
>
> On Jun 2
ultimately resides. There really is
> a method to my madness, and if I could explain it… these questions really
> would make sense. ;-)
>
> TIA,
>
> -Mike
>
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
;>>> will cut the effort of learning scala.
>>>>>
>>>>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html
>>>>>
>>>>> - Thanks, via mobile, excuse brevity.
>>>>> On Jun 18, 2016 2:34 PM, "Aakash Basu" <raj2coo...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've a python code, which I want to convert to Scala for using it in
>>>>>> a Spark program. I'm not so well acquainted with python and learning
>>>>>> scala
>>>>>> now. Any Python+Scala expert here? Can someone help me out in this
>>>>>> please?
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Aakash.
>>>>>>
>>>>>
>>>
--
Best Regards,
Ayan Guha
col4 then why does it shuffle everything whereas
> it need to sort each partitions and then should grouping there itself.
>
> Bit confusing , I am using 1.5.1
>
> Is it fixed in future versions.
>
> Thanks
>
--
Best Regards,
Ayan Guha
when you have multiple i have
> to loop on eache columns ?
> >
> > thanks
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
created on the client where the job was submitted? i.e driver and
> AM on the same client?
> Or
> B) yarn decides where the the AM should be created?
>
> 2) Driver and AM run in different processes : is my assumption correct?
>
> Regards,
> Praveen
>
--
Best Regards,
Ayan Guha
g a process on a remote host
>>>> to execute a shell script seems like a lot of effort What are the
>>>> recommended ways to connect and query Spark from a remote client ? Thanks
>>>> Thx !
>>>> --
>>>> View this message in context: Spark Integration Patterns
>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354.html>
>>>> Sent from the Apache Spark User List mailing list archive
>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>>
>>>
>>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
--
Best Regards,
Ayan Guha
of now. See this ticket
> <https://issues.apache.org/jira/browse/SPARK-4226> for more on this.
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Fri, Feb 26, 2016 at 7:01 AM, ayan guha &l
n (select max(column1) from mytable)
>
> Thanks
>
--
Best Regards,
Ayan Guha
Why can't you use the jdbc in hive context? I don't think sharing data
across contexts are allowed.
On 15 Feb 2016 07:22, "Mich Talebzadeh" wrote:
> I am intending to get a table from Hive and register it as temporary table
> in Spark.
>
>
>
> I have created contexts for
Have you tried repartition to larger number of partitions? Also, I would
suggest increase number of executors and give them smaller amount of memory
each.
On 15 Feb 2016 06:49, "gustavolacerdas" wrote:
> I have a machine with 96GB and 24 cores. I'm trying to run a
Thanks. Do we have any forum or study group for certification aspirants? I
would like to join.
On 15 Feb 2016 05:53, "Olivier Girardot"
wrote:
> It does not contain (as of yet) anything > 1.3 (for example in depth
> knowledge of the Dataframe API)
> but you need
Value("rowkey")));
> list.add(get);
>
> }
>
> *Result[] res = table.get(list);//This is really need? because it takes
> extra time to scan right?*
> This piece of code i got from
> http://www.programering.com/a/MTM5kDMwATI.html
>
> please correct if anything wrong :)
>
> Thanks
> Beesh
>
>
--
Best Regards,
Ayan Guha
or management of
> Spark resources) ?
>
> Thank you
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best Regards,
Ayan Guha
to read from Kafka there
> are 5 tasks writing to E/S. So I'm supposing that the task reading from
> Kafka does it in // using 5 partitions and that's why there are then 5
> tasks to write to E/S. But I'm supposing ...
>
> On Feb 16, 2016, at 21:12, ayan guha <guha.a...@gmail.com>
>>> t1, file1, 1, 1, 1
>>>>> t1, file1, 1, 2, 3
>>>>> t1, file2, 2, 2, 2, 2
>>>>> t2, file1, 5, 5, 5
>>>>> t2, file2, 1, 1, 2, 2
>>>>>
>>>>> and i want to achieve the output like below rows which is a vertical
>>>>> addition of the corresponding numbers.
>>>>>
>>>>> *Output*
>>>>> “file1” : [ 1+1+5, 1+2+5, 1+3+5 ]
>>>>> “file2” : [ 2+1, 2+1, 2+2, 2+2 ]
>>>>>
>>>>> I am in a spark streaming context and i am having a hard time trying
>>>>> to figure out the way to group by file name.
>>>>>
>>>>> It seems like i will need to use something like below, i am not sure
>>>>> how to get to the correct syntax. Any inputs will be helpful.
>>>>>
>>>>> myDStream.foreachRDD(rdd => rdd.groupBy())
>>>>>
>>>>> I know how to do the vertical sum of array of given numbers, but i am
>>>>> not sure how to feed that function to the group by.
>>>>>
>>>>> def compute_counters(counts : ArrayBuffer[List[Int]]) = {
>>>>> counts.toList.transpose.map(_.sum)
>>>>> }
>>>>>
>>>>> ~Thanks,
>>>>> Vinti
>>>>>
>>>>
>>>>
>>>
>>
>
--
Best Regards,
Ayan Guha
is not a good choice, yet, for the use case, what are the
other alternatives?
appreciate any help/pointers/guidance.
--
Best Regards,
Ayan Guha
ou use a select query, the output is
>> automatically displayed as a chart.
>>
>> As RDDs are bound to the context that creates them, I don't think
>> Zeppelin can use those RDDs.
>>
>> I don't know if notebooks can be reused within other notebooks. It would
>&g
gt; On Tue, Mar 8, 2016 at 8:50 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Why not compare current time in every batch and it meets certain
>> condition emit the data?
>> On 9 Mar 2016 00:19, "Abhishek Anand" <abhis.anan...@gmail.com> wrote:
>&
> It's same as hive thrift server. I believe kerberos is supported.
>
> On Wed, Mar 16, 2016 at 10:48 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> so, how about implementing security? Any pointer will be helpful
>>
>> On Wed, Mar 16, 2016 at 1:
What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Why do you want to reimplement something which is already there?
>>
>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>>
>&g
Hi All
Asking opinion: is it possible/advisable to use spark to replace what sqoop
does? Any existing project done in similar lines?
--
Best Regards,
Ayan Guha
>
> Hi All
>
> I found this issue listed in Spark Jira -
> https://issues.apache.org/jira/browse/SPARK-8659
>
> I would love to know if there are any roadmap for this? Maybe someone from
> dev group can confirm?
>
> Thank you in advance
>
> Best
> Ayan
>
>
, 2016 at 10:32 PM, ayan guha <guha.a...@gmail.com> wrote:
> Thanks guys for reply. Yes, Zeppelin with Spark is pretty compelling
> choice, for single user. Any pointers for using Zeppelin for multi user
> scenario? In essence, can we either (a) Use Zeppelin to connect to a long
Why not compare current time in every batch and it meets certain condition
emit the data?
On 9 Mar 2016 00:19, "Abhishek Anand" wrote:
> I have a spark streaming job where I am aggregating the data by doing
> reduceByKeyAndWindow with inverse function.
>
> I am keeping
java:111)
>>>>>> at java.lang.Thread.run(Thread.java:744)
>>>>>> 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch
>>>>>> (1/3) for 6 outstanding blocks after 5000 ms
>>>>>> 16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive
>>>>>> connection to maprnode5, creating a new one.
>>>>>> 16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in
>>>>>> connection from maprnode5
>>>>>> java.io.IOException: Connection reset by peer
>>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>>>> at
>>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>>>>>> at
>>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>>>>>> at
>>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>>>>>> at
>>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>>>>>> at
>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>>>>> at
>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>>>>> at
>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>>>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>>>>> at
>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>>>>> at java.lang.Thread.run(Thread.java:744)
>>>>>> 16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have 1
>>>>>> requests outstanding when connection from maprnode5 is closed
>>>>>> 16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while
>>>>>> starting block fetches
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation> [image:
>>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter]
>>> <https://twitter.com/Xactly> [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp> [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter]
> <https://twitter.com/Xactly> [image: Facebook]
> <https://www.facebook.com/XactlyCorp> [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>
--
Best Regards,
Ayan Guha
`user.timestamp` as
> rawTimeStamp, `user.requestId` as requestId,
> *floor(`user.timestamp`/72000*) as timeBucket FROM logs");
> bucketLogs.toJSON().saveAsTextFile("target_file");
>
> Regards
> Ashok
>
--
Best Regards,
Ayan Guha
ut some other knowledgeable people on the list, please chime
>>> in). Two, since Spark is written in Scala, it gives you an enormous
>>> advantage to read sources (which are well documented and highly readable)
>>> should you have to consult or learn nuances of certain API method or action
>>> not covered comprehensively in the docs. And finally, there’s a long term
>>> benefit in learning Scala for reasons other than Spark. For example,
>>> writing other scalable and distributed applications.
>>>
>>>
>>> Particularly, we will be using Spark Streaming. I know a couple of years
>>> ago that practically forced the decision to use Scala. Is this still the
>>> case?
>>>
>>>
>>> You’ll notice that certain APIs call are not available, at least for
>>> now, in Python.
>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>>>
>>>
>>> Cheers
>>> Jules
>>>
>>> --
>>> The Best Ideas Are Simple
>>> Jules S. Damji
>>> e-mail:dmat...@comcast.net
>>> e-mail:jules.da...@gmail.com
>>>
>>>
>
--
Best Regards,
Ayan Guha
You can add the index from mappartitionwithindex in the output and order
based on that in merge step
On 19 May 2016 13:22, "Pulasthi Supun Wickramasinghe"
wrote:
> Hi Devs/All,
>
> I am pretty new to Spark. I have a program which does some map reduce
> operations with
f2.txt
> > tar2:
> > - f1.txt
> > - f2.txt
> >
> > (each tar file will have exact same number of files, same name)
> >
> > I am trying to find a way (spark or pig) to extract them to their own
> folders.
> >
> > f1
> > - tar1_f1.txt
> > - tar2_f1.txt
> > f2:
> >- tar1_f2.txt
> >- tar1_f2.txt
> >
> > Any help?
> >
> >
> >
> > --
> > Best Regards,
> > Ayan Guha
>
>
>
01.4 sec HDFS Read:
> 5318569 HDFS Write: 46 SUCCESS
>
> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>
> OK
>
> INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%,
> Cumulative CPU 101.4 sec
>
> INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>
> INFO : Ended Job = job_1463956731753_0005
>
> INFO : MapReduce Jobs Launched:
>
> INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec
> HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>
> INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>
> INFO : Completed executing
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
> Time taken: 142.525 seconds
>
> INFO : OK
>
> +-++---+---+--+
>
> | c0 | c1 | c2 | c3 |
>
> +-++---+---+--+
>
> | 1 | 1 | 5.0005E7 | 2.8867513459481288E7 |
>
> +-++---+---+--+
>
> 1 row selected (142.744 seconds)
>
>
>
> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with
> Hive on Spark. So you can obviously gain pretty well by using Hive on Spark.
>
>
>
> Please also note that I did not use any vendor's build for this purpose. I
> compiled Spark 1.3.1 myself.
>
>
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com/
>
>
>
>
--
Best Regards,
Ayan Guha
folders.
f1
- tar1_f1.txt
- tar2_f1.txt
f2:
- tar1_f2.txt
- tar1_f2.txt
Any help?
--
Best Regards,
Ayan Guha
Hi
Can we write to non default hive table using pyspark?
How many executors are you running? Is your partition scheme ensures data
is distributed evenly? It is possible that your data is skewed and one of
the executors failing. Maybe you can try reduce per executor memory and
increase partitions.
On 2 May 2016 14:19, "Buntu Dev"
How about outer join?
On 9 May 2016 13:18, "Raghava Mutharaju" wrote:
> Hello All,
>
> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key
> (number of partitions are same for both the RDDs). We would like to
> subtract rdd2 from rdd1.
>
> The subtract
>>> I'm glad you've mentioned it.
>>>
>>> I think Cloudera (and Hortonworks?) guys are doing a great job with
>>> bringing all the features of YARN to Spark and I think Spark on YARN
>>> shines features-wise.
>>>
>>> I'm not in a position to compare YARN vs Mesos for their resource
>>> management, but Spark on Mesos is certainly lagging behind Spark on
>>> YARN regarding the features Spark uses off the scheduler backends --
>>> security, data locality, queues, etc. (or I might be simply biased
>>> after having spent months with Spark on YARN mostly?).
>>>
>>> Jacek
>>>
>>
>>
>
--
Best Regards,
Ayan Guha
, 2016 at 1:38 AM, Michael Segel <msegel_had...@hotmail.com>
wrote:
> Hey, silly question?
>
> If you’re running a load balancer, are you trying to reuse the RDDs
> between jobs?
>
> TIA
> -Mike
>
> On Jul 13, 2016, at 9:08 AM, ayan guha <guha.a...@gmail.com
it
>> is taking 2 hours for inserting / upserting 5ooK records in parquet format
>> in some hdfs location where each location gets mapped to one partition.
>>
>> My spark conf specs are :
>>
>> yarn cluster mode. single node.
>> spark.executor.memory 8g
>> spark.rpc.netty.dispatcher.numThreads 2
>>
>> Thanks,
>> Sumit
>>
>>
>>
>
--
Best Regards,
Ayan Guha
me see how HBase might efficiently
> tackle this classic upsert case.
>
> Thanks,
> Sumit
>
> On Fri, Jul 29, 2016 at 3:22 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> This is a classic case compared to hadoop vs DWH implmentation.
>>
>> Source (Delt
not possible anymore?
[image: Inline image 1]
--
Best Regards,
Ayan Guha
No I confirmed master is running by spark ui at localhost:8080
On 1 Aug 2016 18:22, "Nikolay Zhebet" <phpap...@gmail.com> wrote:
> I think you haven't run spark master yet, or maybe port 7077 is not yours
> default port for spark master.
>
> 2016-08-01 4:24
In order to use existing pg UDF, you may create a view in pg and expose the
view to hive.
Spark to database connection happens from each executors, so you must have
a connection or a pool of connection per worker. Executors of the same
worker can share connection pool.
Best
Ayan
On 25 Jul 2016
You can use rank with window function. Rank=1 is same as calling first().
Not sure how you would randomly pick records though, if there is no Nth
record. In your example, what happens if data is of only 2 rows?
On 27 Jul 2016 00:57, "Alex Nastetsky"
wrote:
>
ava recipes for Apache Spark updated.
>>> It's done here: http://jgp.net/2016/07/22/spark-java-recipes/ and in
>>> the GitHub repo.
>>>
>>> Enjoy / have a great week-end.
>>>
>>> jg
>>>
>>>
>>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
--
Best Regards,
Ayan Guha
images = sc.binaryFiles("/myimages/*.jpg")
> image_to_text = lambda rawdata: do_some_with_bytes(file_bytes(rawdata))
> print images.values().map(image_to_text).take(1) #this gives an error
>
>
> What is the way to load this library?
>
>
--
Best Regards,
Ayan Guha
| 2015-12-15| XYZ LTD CD 4636 | 10.95|
> +---+--+---+
>
> Now if I want to use the var maxdate in place of "2015-12-15", how would I
> do that?
>
> I tried lit(maxdate) etc but they are all giving me error?
>
> java.lang.RuntimeException: Unsupported literal type class
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> [2015-12-15]
>
>
> Thanks
>
--
Best Regards,
Ayan Guha
vant to these questions.
>
> Thanks again.
>
>
>
> On Sat, Jul 30, 2016 at 1:42 AM, Bhaarat Sharma <bhaara...@gmail.com>
> wrote:
>
>> Great, let me give that a shot.
>>
>> On Sat, Jul 30, 2016 at 1:40 AM, ayan guha <guha.a...@gmail.com> wrote:
it possible to achieve what I'm after? I don't want to write files to
> local file system and them put them in HDFS. Instead, I want to use the
> saveAsTextFile method on the RDD directly.
>
>
>
--
Best Regards,
Ayan Guha
ss.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable fo
erty which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 10:36, ayan guha <guha.a...@gmail.com> wrote:
&g
Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising f
gt; <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other proper
w?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, d
It would help to share spark version, env details and code snippet. There
are many very knowledgeable guys here who can then be able to help
On 1 Aug 2016 02:15, "Tony Lane" wrote:
> Can someone help me understand this error which occurs while running a
> filter on a
)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
EventExecutor.java:111)
... 1 more
Am I doing something wrong?
--
Best Regards,
Ayan Guha
gt;>> your responsibility) and if their optimatizations are correctly configured
>>>> (min max index, bloom filter, compression etc) .
>>>>
>>>> If you need to ingest sensor data you may want to store it first in
>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>
>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com>
>>>> wrote:
>>>>
>>>> Just wondering advantages and disadvantages to convert data into ORC or
>>>> Parquet.
>>>>
>>>> In the documentation of Spark there are numerous examples of Parquet
>>>> format.
>>>>
>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>
>>>> Also : current data compression is bzip2
>>>>
>>>>
>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>> This seems like biased.
>>>>
>>>>
>>>
>>
>
--
Best Regards,
Ayan Guha
Zeppelin already has a param for jdbc
On 2 Aug 2016 19:50, "Mich Talebzadeh" wrote:
> Ok I have already set up mine
>
>
> hive.limit.optimize.fetch.max
> 5
>
> Maximum number of rows allowed for a smaller subset of data for
> simple LIMIT,
I would stay away from transaction tables until they are fully baked. I do
not see why you need to update vs keep inserting with timestamp and while
joining derive latest value on the fly.
But I guess it has became a religious question now :) and I am not
unbiased.
On 3 Aug 2016 08:51, "Mich
201 - 300 of 709 matches
Mail list logo