is to show overlapping partitions, duplicates. index to partition
mismatch - that sort of thing.
On Thu, Aug 13, 2015 at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yep, and it works fine for operations which does not involve any shuffle
(like foreach,, count etc) and those which
is to show overlapping partitions, duplicates. index to partition
mismatch - that sort of thing.
On Thu, Aug 13, 2015 at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yep, and it works fine for operations which does not involve any shuffle
(like foreach,, count etc) and those which
Looks like a jar version conflict to me.
Thanks
Best Regards
On Thu, Aug 13, 2015 at 7:59 PM, satish chandra j jsatishchan...@gmail.com
wrote:
HI,
Please let me know if I am missing anything in the below mail, to get the
issue fixed
Regards,
Satish Chandra
On Wed, Aug 12, 2015 at 6:59
You can put a try..catch around all the transformations that you are doing
and catch such exceptions instead of crashing your entire job.
Thanks
Best Regards
On Fri, Aug 14, 2015 at 4:35 PM, hide x22t33...@gmail.com wrote:
Hello,
I'm using spark on yarn cluster and using
12, 2015 at 10:57 PM, Imran Rashid iras...@cloudera.com wrote:
yikes.
Was this a one-time thing? Or does it happen consistently? can you turn
on debug logging for o.a.s.scheduler (dunno if it will help, but maybe ...)
On Tue, Aug 11, 2015 at 8:59 AM, Akhil Das ak...@sigmoidanalytics.com
Can you look in the worker logs and see what going wrong?
Thanks
Best Regards
On Wed, Aug 12, 2015 at 9:53 PM, Nupur Kumar (BLOOMBERG/ 731 LEX)
nkumar...@bloomberg.net wrote:
Hello,
I am doing this for the first time so feel free to let me know/forward
this to where it needs to be if not
You need to add that jar in the classpath. While submitting the job, you
can use --jars, --driver-classpath etc configurations to add the jar. Apart
from that if you are running the job as a standalone application, then you
can use the sc.addJar option to add the jar (which will ship this jar into
Have a look at spark.shuffle.manager, You can switch between sort and hash
with this configuration.
spark.shuffle.managersortImplementation to use for shuffling data. There
are two implementations available:sort and hash. Sort-based shuffle is more
memory-efficient and is the default option
You can do something like this:
val fStream = ssc.textFileStream(/sigmoid/data/)
.map(x = {
try{
//Move all the transformations within a try..catch
}catch{
case e: Exception = { logError(Whoops!! ); null }
}
})
Thanks
Best Regards
On Mon, Aug 10, 2015 at 7:44 PM, Mario Pastorelli
Load data to where? To spark? If you are referring to spark, then there are
some differences the way the connector is implemented. When you use spark,
the most important thing that you get is the parallelism (depending on the
number of partitions). If you compare it with a native java driver then
Like this: (Including the filter function)
JavaPairInputDStreamLongWritable, Text inputStream = ssc.fileStream(
testDir.toString(),
LongWritable.class,
Text.class,
TextInputFormat.class,
new FunctionPath, Boolean() {
@Override
public Boolean call(Path
Hi Starch,
It also depends on the applications behavior, some might not be properly
able to utilize the network. If you are using say Kafka, then one thing
that you should keep in mind is the Size of the individual message and the
number of partitions that you are having. The higher the message
You can create a new Issue and send a pull request for the same i think.
+ dev list
Thanks
Best Regards
On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote:
Dear Sir / Madam,
I have a plan to contribute some codes about passing filters to a
datasource as physical
There's a daily quota and a minutely quota, you could be hitting those. You
can ask google to increase the quota for the particular service. Now, to
reduce the limit from the spark side, you can actually to a re-partition to
a smaller number before doing the save. Another way to use the local file
You can create a new Issue and send a pull request for the same i think.
+ dev list
Thanks
Best Regards
On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote:
Dear Sir / Madam,
I have a plan to contribute some codes about passing filters to a
datasource as physical
Hi
My Spark job (running in local[*] with spark 1.4.1) reads data from a
thrift server(Created an RDD, it will compute the partitions in
getPartitions() call and in computes hasNext will return records from these
partitions), count(), foreach() is working fine it returns the correct
number of
Hi
My Spark job (running in local[*] with spark 1.4.1) reads data from a
thrift server(Created an RDD, it will compute the partitions in
getPartitions() call and in computes hasNext will return records from these
partitions), count(), foreach() is working fine it returns the correct
number of
Isnt it a space separated data? It is not a comma(,) separated nor pipe (|)
separated data.
Thanks
Best Regards
On Mon, Aug 10, 2015 at 12:06 PM, Netwaver wanglong_...@163.com wrote:
Hi Spark experts,
I am now using Spark 1.4.1 and trying Spark SQL/DataFrame
API with text
Are you setting SPARK_PREPEND_CLASSES? try to disable it. Here your uber
jar which does not have the SparkConf is put in the first place of the
class-path which is messing it up.
Thanks
Best Regards
On Thu, Aug 6, 2015 at 5:48 PM, Stephen Boesch java...@gmail.com wrote:
Given the following
Did you try this way?
export HIVE_SERVER2_THRIFT_PORT=6066
./sbin/start-thriftserver.sh --master master-uri
export HIVE_SERVER2_THRIFT_PORT=6067
./sbin/start-thriftserver.sh --master master-uri
You just have to change HIVE_SERVER2_THRIFT_PORT to instantiate multiple
servers i think.
Thanks
Just make sure your hadoop instances are functioning properly, (check for
ResourceManager, NodeManager). How are you submitting the job? If that is
getting submitted then you can look further in the yarn logs to see whats
really going on.
Thanks
Best Regards
On Thu, Aug 6, 2015 at 6:59 PM, Clint
Which version of spark are you using? Looks like you are hitting the file
handles. In that case you might want to increase the ulimit. You can
actually validate this by looking in the worker logs (which would probably
say Too many open files exception).
Thanks
Best Regards
On Thu, Aug 6, 2015 at
Can you look more in the worker logs and see whats going on? It looks like
a memory issue (kind of GC overhead etc., You need to look in the worker
logs)
Thanks
Best Regards
On Fri, Aug 7, 2015 at 3:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Re attaching the images.
On Thu, Aug 6, 2015
Did you try this way?
/usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf
spark.mesos.executor.docker.image=docker.repo/spark:latest --class
org.apache.spark.examples.SparkPi *--jars hdfs://hdfs1/tmp/spark-*
*examples-1.4.1-hadoop2.6.0-**cdh5.4.4.jar* 100
Thanks
Best
I'm not sure what you are upto, but if you can explain what you are trying
to achieve then may be we can restructure your code. On a quick glance i
could see :
tweetsRDD*.collect()*.map(tweet=
DBQuery.saveTweets(tweet))
Which will bring the whole data into your driver machine and it would
Depends on which operation you are doing, If you are doing a .count() on a
parquet, it might not download the entire file i think, but if you do a
.count() on a normal text file it might pull the entire file.
Thanks
Best Regards
On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya aara...@gmail.com
Why not give it a shot? Spark always outruns old mapreduce jobs.
Thanks
Best Regards
On Sat, Aug 8, 2015 at 8:25 AM, linlma lin...@gmail.com wrote:
I have a tens of million records, which is customer ID and city ID pair.
There are tens of millions of unique customer ID, and only a few
It seems, it is not able to pick up the debug parameters. You can actually
set export
_JAVA_OPTIONS=-agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=y
and then submit the job to enable debugging.
Thanks
Best Regards
On Fri, Aug 7, 2015 at 10:20 PM, Benjamin Ross
I think you can. Give it a try and see.
Thanks
Best Regards
On Tue, Aug 4, 2015 at 7:02 AM, swetha swethakasire...@gmail.com wrote:
Hi,
Can I use multiple UpdateStateByKey Functions in the Streaming job? Suppose
I need to maintain the state of the user session in the form of a Json and
For some reason you are having two different versions of spark jars in your
classpath.
Thanks
Best Regards
On Tue, Aug 4, 2015 at 12:37 PM, Deepesh Maheshwari
deepesh.maheshwar...@gmail.com wrote:
Hi,
I am trying to read data from kafka and process it using spark.
i have attached my source
...@platalytics.com
wrote:
thanks alot
On Tue, Aug 4, 2015 at 2:00 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You will have to write your own consumer for pulling your custom
feeds, and then you can do a union
(customfeedDstream.union(twitterStream))
with the twitter stream api.
Thanks
Best
Welcome aboard!
Thanks
Best Regards
On Thu, Aug 6, 2015 at 11:21 AM, Franc Carter franc.car...@rozettatech.com
wrote:
subscribe
You just pasted your twitter credentials, consider changing it. :/
Thanks
Best Regards
On Wed, Aug 5, 2015 at 10:07 PM, narendra narencs...@gmail.com wrote:
Thanks Akash for the answer. I added endpoint to the listener and now it is
working.
--
View this message in context:
If you are using Kafka, then you can basically push an entire file as a
message to Kafka. In that case in your DStream, you will receive the single
message which is the contents of the file and it can of course span
multiple lines.
Thanks
Best Regards
On Mon, Aug 3, 2015 at 8:27 PM, Spark
I think you can start from here
https://issues.apache.org/jira/browse/SPARK/fixforversion/12332078/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
Thanks
Best Regards
On Tue, Aug 4, 2015 at 12:02 PM, Meihua Wu rotationsymmetr...@gmail.com
wrote:
I think the team is
One approach would be to use a Jobserver in between, create SparkContexts
in it. Lets say you create two, one which is configured to run on
coarse-grained and another set to fine-grained. Let the high priority jobs
hit the coarse-grained SparkContext and the other jobs use the fine-grained
one.
Just to add rdd.take(1) won't trigger the entire computation, it will just
pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files
to trigger the complete pipeline. How many partitions do you see in the
last stage?
Thanks
Best Regards
On Tue, Aug 4, 2015 at 7:10 AM, ayan guha
that i want to ask is that i have used twitters streaming
api.and it seems that the above solution uses rest api. how can i used both
simultaneously ?
Any response will be much appreciated :)
Regards
On Tue, Aug 4, 2015 at 1:51 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yes you can
Are you sitting behind a firewall and accessing a remote master machine? In
that case, have a look at this
http://spark.apache.org/docs/latest/configuration.html#networking, you
might want to fix few properties like spark.driver.host, spark.driver.host
etc.
Thanks
Best Regards
On Mon, Aug 3,
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA
to add this feature and may be in future release it could be added.
Thanks
Best Regards
On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly moreill...@qub.ac.uk
wrote:
Hi,
I am currently working on the latest version of
I guess it goes through that 500k files
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193for
the first time and then use a filter from next time.
Thanks
Best Regards
On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das
LOL Brandon!
@ziqiu See http://spark.apache.org/community.html
You need to send an email to user-unsubscr...@spark.apache.org
Thanks
Best Regards
On Fri, Jul 31, 2015 at 2:06 AM, Brandon White bwwintheho...@gmail.com
wrote:
https://www.youtube.com/watch?v=JncgoPKklVE
On Thu, Jul 30, 2015
specific to my account?
Thanks in anticipation :)
On Thu, Jul 30, 2015 at 6:17 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Owh, this one fetches the public tweets, not the one specific to your
account.
Thanks
Best Regards
On Thu, Jul 30, 2015 at 6:11 PM, Sadaf Khan sa
It seem an issue with the ES connector
https://github.com/elastic/elasticsearch-hadoop/issues/482
Thanks
Best Regards
On Tue, Jul 28, 2015 at 6:14 AM, An Tran tra...@gmail.com wrote:
Hello all,
I am currently having an error with Spark SQL access Elasticsearch using
Elasticsearch Spark
What operation are you doing with streaming? Also can you look in the
datanode logs and see whats going on?
Thanks
Best Regards
On Tue, Jul 28, 2015 at 8:18 AM, guoqing0...@yahoo.com.hk
guoqing0...@yahoo.com.hk wrote:
Hi,
I got a error when running spark streaming as below .
Like this?
val data = sc.textFile(/sigmoid/audio/data/, 24).foreachPartition(urls =
speachRecognizer(urls))
Let 24 be the total number of cores that you have on all the workers.
Thanks
Best Regards
On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf opus...@gmail.com wrote:
Hello, I am writing a
Did you try removing this jar? build/sbt-launch-0.13.7.jar
Thanks
Best Regards
On Tue, Jul 28, 2015 at 12:08 AM, Rahul Palamuttam rahulpala...@gmail.com
wrote:
Hi All,
I hope this is the right place to post troubleshooting questions.
I've been following the install instructions and I get
You can easily push data to an intermediate storage from spark streaming
(like HBase or a SQL/NoSQL DB etc) and then power your dashboards with d3
js.
Thanks
Best Regards
On Tue, Jul 28, 2015 at 12:18 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
I have just started using Spark Streaming and
sc.parallelize takes a second parameter which is the total number of
partitions, are you using that?
Thanks
Best Regards
On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios
kostas.koug...@googlemail.com wrote:
Hi, I do an sc.parallelize with a list of 512k items. But sometimes not all
executors
at 1:23 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can easily push data to an intermediate storage from spark streaming
(like HBase or a SQL/NoSQL DB etc) and then power your dashboards with d3
js.
Thanks
Best Regards
On Tue, Jul 28, 2015 at 12:18 PM, UMESH CHAUDHARY umesh9
)
at java.lang.Thread.run(Thread.java:745)
*From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
*Sent:* Tuesday, July 28, 2015 2:30 PM
*To:* Manohar Reddy
*Cc:* user@spark.apache.org
*Subject:* Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client
You need to trigger an action on your
Put a try catch inside your code and inside the catch print out the length
or the list itself which causes the ArrayIndexOutOfBounds. It might happen
that some of your data is not proper.
Thanks
Best Regards
On Mon, Jul 27, 2015 at 8:24 PM, Manohar753 manohar.re...@happiestminds.com
wrote:
Hi
You need to find the bottleneck here, it could your network (if the data is
huge) or your producer code isn't pushing at 20k/s, If you are able to
produce at 20k/s then make sure you are able to receive at that rate (try
it without spark).
Thanks
Best Regards
On Sat, Jul 25, 2015 at 3:29 PM,
You need to find the bottleneck here, it could your network (if the data is
huge) or your producer code isn't pushing at 20k/s, If you are able to
produce at 20k/s then make sure you are able to receive at that rate (try
it without spark).
Thanks
Best Regards
On Sat, Jul 25, 2015 at 3:29 PM,
With s3n try this out:
*s3service.s3-endpoint*The host name of the S3 service. You should only
ever change this value from the default if you need to contact an
alternative S3 endpoint for testing purposes.
Default: s3.amazonaws.com
Thanks
Best Regards
On Tue, Jul 28, 2015 at 1:54 PM, Schmirr
Did you try it with just: (comment out line 27)
println Count of spark: + file.filter({s - s.contains('spark')}).count()
Thanks
Best Regards
On Sun, Jul 26, 2015 at 12:43 AM, tog guillaume.all...@gmail.com wrote:
Hi
I have been using Spark for quite some time using either scala or python.
One approach would be to store the batch data in an intermediate storage
(like HBase/MySQL or even in zookeeper), and inside your filter function
you just go and read the previous value from this storage and do whatever
operation that you are supposed to do.
Thanks
Best Regards
On Sun, Jul 26,
Did you try binding to 0.0.0.0?
Thanks
Best Regards
On Mon, Jul 27, 2015 at 10:37 PM, Wayne Song wayne.e.s...@gmail.com wrote:
Hello,
I am trying to start a Spark master for a standalone cluster on an EC2
node.
The CLI command I'm using looks like this:
Note that I'm specifying the
);
}
*From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
*Sent:* Tuesday, July 28, 2015 1:52 PM
*To:* Manohar Reddy
*Cc:* user@spark.apache.org
*Subject:* Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client
Put a try catch inside your code and inside the catch print
)
2015-07-27 11:17 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:
So you are able to access your AWS S3 with s3a now? What is the error
that
you are getting when you try to access the custom storage with
fs.s3a.endpoint?
Thanks
Best Regards
On Mon, Jul 27, 2015 at 2:44 PM, Schmirr Wurst
How about IntelliJ? It also has a Terminal tab.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 6:06 PM, saif.a.ell...@wellsfargo.com wrote:
Hi all,
I tried Notebook Incubator Zeppelin, but I am not completely happy with it.
What do you people use for coding? Anything with auto-complete,
Have a look at the current security support
https://spark.apache.org/docs/latest/security.html, Spark does not have
any encryption support for objects in memory out of the box. But if your
concern is to protect the data being cached in memory, then you can easily
encrypt your objects in memory
You can follow this doc
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup
Thanks
Best Regards
On Fri, Jul 24, 2015 at 10:56 AM, Siva Reddy ksiv...@gmail.com wrote:
Hi All,
I am trying to setup the Eclipse (LUNA) with Maven so that I
Its a serialization error with nested schema i guess. You can look at the
twitters chill avro serializer library. Here's two discussion on the same:
- https://issues.apache.org/jira/browse/SPARK-3447
-
Whats in your build.sbt? You could be messing with the scala version it
seems.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 2:15 AM, Dan Dong dongda...@gmail.com wrote:
Hi,
When I ran with spark-submit the following simple Spark program of:
import org.apache.spark.SparkContext._
import
This spark.shuffle.sort.bypassMergeThreshold might help, You could also try
setting the shuffle manager to hash from sort. You can see more
configuration options from here
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 3:33
For each of your job, you can pass spark.ui.port to bind to a different
port.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 7:49 PM, Joji John jj...@ebates.com wrote:
Thanks Ajay.
The way we wrote our spark application is that we have a generic python
code, multiple instances of which can
?
2015-07-20 18:11 GMT+02:00 Schmirr Wurst schmirrwu...@gmail.com:
Thanks, that is what I was looking for...
Any Idea where I have to store and reference the corresponding
hadoop-aws-2.6.0.jar ?:
java.io.IOException: No FileSystem for scheme: s3n
2015-07-20 8:33 GMT+02:00 Akhil Das
alternative from Python?
And also, I want to write the raw bytes of my object into files on disk,
and not using some Serialization format to be read back into Spark.
Is it possible?
Any alternatives for that?
Thanks,
Oren
On Thu, Jul 23, 2015 at 8:04 PM Akhil Das ak...@sigmoidanalytics.com
, but
at the end I will have only hashtags without statuses. Is that correct, or
I missed something?
Thanks,
Zoran
On Wed, Jul 22, 2015 at 12:41 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
That was a pseudo code, working version would look like this:
val stream
PM, ayan guha guha.a...@gmail.com wrote:
Hi Akhil
Thanks.Will definitely take a look. Couple of questions
1. Is it possible to use newHadoopAPI from dataframe.write or saveAs?
2. is esDF usable rom Python?
On Fri, Jul 24, 2015 at 2:29 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Did
I guess it would wait for sometime and throw up something like this:
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory
Thanks
Best Regards
On Thu, Jul 23, 2015 at 7:53 AM, bit1...@163.com bit1...@163.com wrote:
Currently, the only way for you would be to create proper schema for the
data. This is not a bug, but you could open a jira (since this would help
others to solve their similar use-cases) for feature and in future version
it could be implemented and included.
Thanks
Best Regards
On Tue, Jul 21,
Here's a few more configurations
https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-ConfigurationPropertiesinthehive-site.xmlFile
can't find anything on the timeouts though.
Thanks
Best Regards
On Wed, Jul 22, 2015 at 1:01 AM, Judy Nash
You can try adding that jar in SPARK_CLASSPATH (its deprecated though) in
spark-env.sh file.
Thanks
Best Regards
On Tue, Jul 21, 2015 at 7:34 PM, Michal Haris michal.ha...@visualdna.com
wrote:
I have a spark program that uses dataframes to query hive and I run it
both as a spark-shell for
Did you try:
val data = indexed_files.groupByKey
val *modified_data* = data.map { a =
var name = a._2.mkString(,)
(a._1, name)
}
*modified_data*.foreach { a =
var file = sc.textFile(a._2)
println(file.count)
}
Thanks
Best Regards
On Wed, Jul 22, 2015 at 2:18 AM, MorEru
It looks like its picking up the wrong namenode uri from the
HADOOP_CONF_DIR, make sure it is proper. Also for submitting a spark job to
a remote cluster, you might want to look at the spark.driver host and
spark.driver.port
Thanks
Best Regards
On Wed, Jul 22, 2015 at 8:56 PM, rok
Did you happened to look into esDF
https://github.com/elastic/elasticsearch-hadoop/issues/441? You can open
an issue over here if that doesn't solve your problem
https://github.com/elastic/elasticsearch-hadoop/issues
Thanks
Best Regards
On Tue, Jul 21, 2015 at 5:33 PM, ayan guha
You can look into .saveAsObjectFiles
Thanks
Best Regards
On Thu, Jul 23, 2015 at 8:44 PM, Oren Shpigel o...@yowza3d.com wrote:
Hi,
I use Spark to read binary files using SparkContext.binaryFiles(), and then
do some calculations, processing, and manipulations to get new objects
(also
. It would be
great if somebody with experience on this could comment on these concerns.
Thanks,
Zoran
On Mon, Jul 20, 2015 at 12:19 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Jorn meant something like this:
val filteredStream = twitterStream.transform(rdd ={
val newRDD
where I have to store and reference the corresponding
hadoop-aws-2.6.0.jar ?:
java.io.IOException: No FileSystem for scheme: s3n
2015-07-20 8:33 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:
Not in the uri, but in the hadoop configuration you can specify it.
property
namefs.s3a.endpoint
I'd suggest you upgrading to 1.4 as it has better metrices and UI.
Thanks
Best Regards
On Mon, Jul 20, 2015 at 7:01 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Is coalesce not applicable to kafkaStream ? How to do coalesce on
kafkadirectstream its not there in api ?
Shall calling
Do you have HADOOP_HOME, HADOOP_CONF_DIR and hadoop's winutils.exe in the
environment?
Thanks
Best Regards
On Mon, Jul 20, 2015 at 5:45 PM, nitinkalra2000 nitinkalra2...@gmail.com
wrote:
Hi All,
I am working on Spark 1.4 on windows environment. I have to set eventLog
directory so that I can
Here's two ways of doing that:
Without the filter function :
JavaPairDStreamString, String foo =
ssc.String, String, SequenceFileInputFormatfileStream(/tmp/foo);
With the filter function:
JavaPairInputDStreamLongWritable, Text foo = ssc.fileStream(/tmp/foo,
LongWritable.class,
(fs.s3n.endpoint,test.com
)
And I continue to get my data from amazon, how could it be ? (I also
use s3n in my text url)
2015-07-21 9:30 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:
You can add the jar in the classpath, and you can set the property like:
sc.hadoopConfiguration.set(fs.s3a.endpoint
...@gmail.com
wrote:
Hi Akhil,
I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's
the configuration required for this ? From where can I get winutils.exe ?
Thanks and Regards,
Nitin Kalra
On Tue, Jul 21, 2015 at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com
wrote
It could be a GC pause or something, you need to check in the stages tab
and see what is taking time, If you upgrade to Spark 1.4, it has better UI
and DAG visualization which helps you debug better.
Thanks
Best Regards
On Mon, Jul 20, 2015 at 8:21 PM, Pa Rö paul.roewer1...@googlemail.com
wrote:
) is assumed.
/description
/property
Thanks
Best Regards
On Sun, Jul 19, 2015 at 9:13 PM, Schmirr Wurst schmirrwu...@gmail.com
wrote:
I want to use pithos, were do I can specify that endpoint, is it
possible in the url ?
2015-07-19 17:22 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com
Just make sure there is no firewall/network blocking the requests as its
complaining about timeout.
Thanks
Best Regards
On Mon, Jul 20, 2015 at 1:14 AM, ankit tyagi ankittyagi.mn...@gmail.com
wrote:
Just to add more information. I have checked the status of this file, not
a single block is
Jorn meant something like this:
val filteredStream = twitterStream.transform(rdd ={
val newRDD = scc.sc.textFile(/this/file/will/be/updated/frequently).map(x
= (x,1))
rdd.join(newRDD)
})
newRDD will work like a filter when you do the join.
Thanks
Best Regards
On Sun, Jul 19, 2015 at 9:32
Could you name the Storage service that you are using? Most of them
provides a S3 like RestAPI endpoint for you to hit.
Thanks
Best Regards
On Fri, Jul 17, 2015 at 2:06 PM, Schmirr Wurst schmirrwu...@gmail.com
wrote:
Hi,
I wonder how to use S3 compatible Storage in Spark ?
If I'm using
. (no matrices loaded), Same exception is
coming.
Can anyone tell what createDataFrame does internally? Are there any
alternatives for it?
On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I suspect its the numpy filling up Memory.
Thanks
Best Regards
On Fri
Did you try inputs.repartition(1).foreachRDD(..)?
Thanks
Best Regards
On Fri, Jul 17, 2015 at 9:51 PM, PAULI, KEVIN CHRISTIAN
[AG-Contractor/1000] kevin.christian.pa...@monsanto.com wrote:
Spark newbie here, using Spark 1.3.1.
I’m consuming a stream and trying to pipe the data from the
Can you paste the code? How much memory does your system have and how big
is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?
Thanks
Best Regards
On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma
harit.vishwaka...@gmail.com wrote:
Thanks,
Code is running on a single
= sqlCtx.createDataFrame(rdd2)
4. df.save() # in parquet format
It throws exception in createDataFrame() call. I don't know what exactly
it is creating ? everything in memory? or can I make it to persist
simultaneously while getting created.
Thanks
On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das ak
Which version of spark are you using? insertIntoJDBC is deprecated (from
1.4.0), you may use write.jdbc() instead.
Thanks
Best Regards
On Wed, Jul 15, 2015 at 2:43 PM, Manohar753 manohar.re...@happiestminds.com
wrote:
Hi All,
Am trying to add few new rows for existing table in mysql using
likely
would it be that a change like that goes thru? Would it be rejected as an
uncommon scenario? I really don't want to have this as a separate form of
the branch.
Thanks,
Joel
--
*From:* Akhil Das ak...@sigmoidanalytics.com
*Sent:* Wednesday, July 15, 2015 2:07
Did you try this?
*val out=lines.filter(xx={*
val y=xx
val x=broadcastVar.value
var flag:Boolean=false
for(a-x)
{
if(y.contains(a))
flag=true
}
flag
}
*})*
Thanks
Best Regards
On Wed, Jul 15, 2015 at 8:10 PM, Naveen Dabas naveen.u...@ymail.com wrote:
I
Yes you can do that, just make sure you rsync the same file to the same
location on every machine.
Thanks
Best Regards
On Thu, Jul 16, 2015 at 5:50 AM, Julien Beaudan jbeau...@stottlerhenke.com
wrote:
Hi all,
Is it possible to use Spark to assign each machine in a cluster the same
task, but
I think any requests going to s3*:// requires the credentials. If they have
made it public (via http) then you won't require the keys.
Thanks
Best Regards
On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto rpagli...@appcomsci.com
wrote:
Hi Sujit,
I just wanted to access public datasets on
301 - 400 of 1386 matches
Mail list logo