On Thu, 12 Mar 2015 00:48:12 -0700
d34th4ck3r gautam1...@gmail.com wrote:
I'm trying to use Neo4j with Apache Spark Streaming but I am finding
serializability as an issue.
Basically, I want Apache Spark to parse and bundle my data in real
time. After, the data has been bundled it should be
What is GraphDatabaseService object that you are using? Instead of creating
them on the driver (outside foreachRDD), can you create them inside the
RDD.foreach?
In general, the right pattern for doing this in the programming guide
I tried that too. It result in same serializability issue.
GraphDatabaseSerive that I'm using is : GraphDatabaseFactory() :
http://neo4j.com/api_docs/2.0.0/org/neo4j/graphdb/factory/GraphDatabaseFactory.html
On Thu, Mar 12, 2015 at 5:21 PM, Tathagata Das t...@databricks.com wrote:
What is
I got answer from mail posted to ML.
--- Summary ---
cache() is lazy, so you can use `RDD.count()` explicitly to load into
memory.
---
And I tried, two RDDs were cached and the speed became faster.
Thank you.
--
View this message in context:
Hi experts!
Is there any way to connect SAP hana in spark application and get data from
hana tables in our spark application?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/connecting-spark-application-with-SAP-hana-tp22011.html
Sent from the
I am trying out streaming example as documented and I am using spark 1.2.1
streaming from maven for Java.
When I add this code I get compilation error on and eclipse is not able to
recognize Tuple2. I also don't see any import scala.Tuple2 class.
I'm trying to use Neo4j with Apache Spark Streaming but I am finding
serializability as an issue.
Basically, I want Apache Spark to parse and bundle my data in real time.
After, the data has been bundled it should be stored in the database, Neo4j.
However, I am getting this error:
There are a couple of differences between the ml-matrix implementation and
the one used in AMPCamp
- I think the AMPCamp one uses JBLAS which tends to ship native BLAS
libraries along with it. In ml-matrix we switched to using Breeze + Netlib
BLAS which is faster but needs some setup [1] to pick
SAP hana can be integrated with hadoop
http://saphanatutorial.com/sap-hana-and-hadoop/, so you will be able to
read/write to it using newAPIHadoopFile api of spark by passing the correct
Configurations etc.
Thanks
Best Regards
On Thu, Mar 12, 2015 at 1:15 PM, Hafiz Mujadid
A couple points:
You've got mismatched versions here -- 1.2.0 vs 1.2.1. You should fix
that but it's not your problem.
These are also supposed to be 'provided' scope dependencies in Maven.
You should get the Scala deps transitively and can import scala.*
classes. However, it would be a little
I am not sure if you realized but the code snipper it pretty mangled up in
the email we received. It might be a good idea to put the code in pastebin
or gist, much much easier for everyone to read.
On Thu, Mar 12, 2015 at 12:48 AM, d34th4ck3r gautam1...@gmail.com wrote:
I'm trying to use Neo4j
Alright, I have also asked this question in StackOverflow:
http://stackoverflow.com/questions/28896898/using-neo4j-with-apache-spark
The code there is pretty neat.
On Thu, Mar 12, 2015 at 4:55 PM, Tathagata Das t...@databricks.com wrote:
I am not sure if you realized but the code snipper it
Well the answers you got there are correct as well.
Unfortunately I am not familiar with Neo4j enough to comment any more. Is
the Neo4j graph database running externally (outside Spark cluster), or
within the driver process, or on all the executors? Can you clarify that?
TD
On Thu, Mar 12, 2015
Ted Dunning and Ellen Friedman's Time Series Databases has a section on
this with some approaches to geo-encoding:
https://www.mapr.com/time-series-databases-new-ways-store-and-access-data
http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf
On Tue, Mar 10, 2015 at 3:53 PM, John Meehan
Neo4j is running externally. It has nothing to do with Spark processes.
Basically, the problem is, I'm unable to figure out a way to store output
of Spark on the database. As Spark Streaming requires Neo4j Core Java API
to be serializable as well.
The answer points out to using REST API but
This: java.lang.NoSuchMethodError almost always indicates a version
conflict somewhere.
It looks like you are using Spark 1.1.1 with the cassandra-spark connector
1.2.0. Try aligning those. Those metrics were introduced recently in the
1.2.0 branch of the cassandra connector.
Either upgrade your
It works after sync, thanks for the pointers
On Tue, Mar 10, 2015 at 1:22 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I navigated to maven dependency and found scala library. I also found
Tuple2.class and when I click on it in eclipse I get invalid LOC header
(bad signature)
Thanks for the reference. Is the following procedure correct?
1.Copy of the Hadoop source code
org.apache.hadoop.mapreduce.lib.input .TextInputFormat.java as my own class,
e.g. UncTextInputFormat.java
2.Modify UncTextInputFormat.java to handle UNC path
3.Call
Thanks Akhil.
Additionaly if we want to do sql query we need to create JavaPairRdd, then
JavaRdd, then JavaSchemaRdd and then sqlContext.sql(sql query). Ryt ?
Thanks,
Udbhav Agarwal
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 12 March, 2015 11:43 AM
To: Udbhav Agarwal
Cc:
Thanks, the new guide did help - instantiating the SQLContext inside foreachRDD
did the trick for me, but the SQLContext singleton works as well.
Now the only problem left is that spark.driver.port is not retained after
starting from a checkpoint, so my Actor receivers are running on a random
@Tobias,
According to my understanding, your approach is to register a series of tables
by using transformWith, right? And then, you can get a new Dstream (i.e.,
SchemaDstream), which consists of lots of SchemaRDDs.
Please correct me if my understanding is wrong.
Thank you Best Regards,
I use s3n://BucketName/SomeFoler/OutputFolder and it works for my app.
On Wed, Mar 11, 2015 at 12:14 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Does it write anything in BUCKET/SUB_FOLDER/output?
Thanks
Best Regards
On Wed, Mar 11, 2015 at 10:15 AM, cpalm3 cpa...@gmail.com wrote:
Streaming takes only new files into consideration. Add the file after
starting the job.
On Thu, Mar 12, 2015 at 2:26 PM, CH.KMVPRASAD [via Apache Spark User List]
ml-node+s1001560n2201...@n3.nabble.com wrote:
yes !
for testing purpose i defined single file in the specified directory
Thanks Todd,
But this link is also based on scala, I was looking for some help with java
Apis.
Thanks,
Udbhav Agarwal
From: Todd Nist [mailto:tsind...@gmail.com]
Sent: 12 March, 2015 5:28 PM
To: Udbhav Agarwal
Cc: Akhil Das; user@spark.apache.org
Subject: Re: hbase sql query
Have you
Ah, missed that java was a requirement. What distribution of Hadoop are
you suing? Here is an example that may help, along with a few links to the
JavaHbaseContext and a basic example.
https://github.com/tmalaska/SparkOnHBase
With fileStream you are free to plugin any InputFormat, in your case, you
can easily plugin ParquetInputFormat. Here' some parquet hadoop examples
https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example
.
Thanks
Best Regards
On Thu, Mar 12, 2015 at
Hi,
What is query time for join query on hbase with spark sql. Say tables in hbase
have 0.5 million records each. I am expecting a query time (latency) in
milliseconds with spark sql. Can this be possible ?
Thanks,
Udbhav Agarwal
I don't think that's the same issue I was seeing, but you can have a
look at https://issues.apache.org/jira/browse/SPARK-4545 for more
detail on my issue.
On Thu, Mar 12, 2015 at 12:51 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Sean,
On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer
Just FYI: what @Marcelo said fixed the issue for me.
On Fri, Mar 6, 2015 at 7:11 AM, Sean Owen so...@cloudera.com wrote:
-Pscala-2.11 and -Dscala-2.11 will happen to do the same thing for this
profile.
Why are you running install package and not just install? Probably
doesn't matter.
This
Thanks Todd,
But this particular code is missing JavaHbaseContext.java file. The code only
has JavaHbaseContext.scala.
Moreover I specifically wanted to perform join queries on tables with millions
of records and expect millisecond latency. For doing a sql query of such type I
guess we need
Interesting. Short term, maybe create the following file with pid 24922 ?
/tmp/spark-taoewang-org.apache.spark.deploy.worker.Worker-1.pid
Cheers
On Thu, Mar 12, 2015 at 6:51 PM, sequoiadb mailing-list-r...@sequoiadb.com
wrote:
Nope, I can see the master file exist but not the worker:
$ ls
Thanks for the reply!
Theoretically I should be able to do as you suggest as I follow the pool design
pattern from the documentation, but I don’t seem to be able to run any code
after .stop() is called.
override def main(args: Array[String]) {
// setup
val ssc = new
Well, that's why I had also suggested using a pool of the GraphDBService
objects :)
Also present in the programming guide link I had given.
TD
On Thu, Mar 12, 2015 at 7:38 PM, Gautam Bajaj gautam1...@gmail.com wrote:
Thanks a ton! That worked.
However, this may have performance issue. As
Thanks a ton! That worked.
However, this may have performance issue. As for each partition, I'd need
to restart the server, that was the basic reason I was creating graphDb
object outside this loop.
On Fri, Mar 13, 2015 at 5:34 AM, Tathagata Das t...@databricks.com wrote:
(Putting user@spark
I sent a PR to add skewed join last year:
https://github.com/apache/spark/pull/3505
However, it does not split a key to multiple partitions. Instead, if a key
has too many values that can not be fit in to memory, it will store the
values into the disk temporarily and use disk files to do the join.
What version of Spark are you using. You may be hitting a known but solved
bug where the receivers would not get stop signal and (stopGracefully =
true) would wait for a while for the receivers to stop indefinitely. Try
setting stopGracefully to false and see if it works.
This bug should have been
Dale,
I basically have the same maven dependency above, but my code will not
compile due to not being able to reference to AvroSaver, though the
saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro
compiles for me, it errors out when running the spark job due to it not
being
Thanks Shixiong,
I'll try out your PR. Do you know what the status of the PR is? Are
there any plans to incorporate this change to the
DataFrames/SchemaRDDs in Spark 1.3?
Soila
On Thu, Mar 12, 2015 at 7:52 PM, Shixiong Zhu zsxw...@gmail.com wrote:
I sent a PR to add skewed join last year:
Hi guys,
I have a stupid question, but I am not sure how to get out of it.
I deployed spark 1.2.1 on a cluster of 30 nodes. Looking at master:8088 I can
see all the workers I have created so far. (I start the cluster with
sbin/start-all.sh)
However, when running a Spark SQL query or even
Hello,
I am wondering how does /join/ work in SparkQL? Does it co-partition two
tables? or does it do it by wide dependency?
I have two big tables to join, the query creates more than 150Go temporary
data, so the query stops because I have no space left my disk.
I guess I could use a
Sorry guys for this.
It seems that I need to start the thrift server with --master
spark://ms0220:7077 option and now I can see applications running in my web UI.
Thanks,Robert
On Thursday, March 12, 2015 10:57 AM, Grandl Robert
rgra...@yahoo.com.INVALID wrote:
I figured out for
Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
Beautifulsoup to parse HTML files.I am facing problems to load html file
into beautiful soup.
Example
filepath= file:///path to html directory
def readhtml(inputhtml):
{
soup=Beautifulsoup(inputhtml) //to load html
Hello Guys,
I'm running into below error:
Exception in thread main java.lang.NoClassDefFoundError:
org/codehaus/jackson/annotate/JsonClass
I have created a uber jar with Jackson-core-asl.1.9.13 and passed it with
--jars configuration, but still getting errors. I searched on the net and
found a
Looking at dependency tree:
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
...
[INFO] | +- org.codehaus.jackson:jackson-core-asl:jar:1.9.2:compile
In root pom.xml :
dependency
groupIdorg.codehaus.jackson/groupId
artifactIdjackson-core-asl/artifactId
If you use DStream.saveAsHadoopFiles (or equivalent RDD ops) with the
appropriate output format (for Avro) then each partition of the RDDs will
be written to a different file. However there is probably going to be a
large number of small files and you may have to run a separate compaction
phase to
sc.wholeTextFile() is what you need.
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles
On Thu, Mar 12, 2015 at 9:26 AM, yh18190 yh18...@gmail.com wrote:
Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
Beautifulsoup to
I'm using a pre-built Spark; I'm not trying to compile Spark.
The compile error appears when I try to register HighlyCompressedMapStatus
in my program:
kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus])
If I don't register it, I get a runtime error saying that it needs
Hi,
How do you use KafkaUtils to specify a specific partition? I'm writing
customer Marathon jobs where a customer is given 1 partition in a topic in
Kafka. The job will get the partition from our database for that customer
and use that to get the messages for that customer.
I misinterpreted
What version of Hadoop are you using, and if so are you setting the
right Hadoop profile? because they already set the Jackson version to
1.9.13 IIRC. So maybe that's the issue.
On Thu, Mar 12, 2015 at 5:15 PM, Ted Yu yuzhih...@gmail.com wrote:
Looking at dependency tree:
[INFO] +-
Hi TD,
I want to append my record to a AVRO file which will be later used for querying.
Having a single file is not mandatory for us but then how can we make the
executors append the AVRO data to multiple files.
Thanks,
Sam
On Mar 12, 2015, at 4:09 AM, Tathagata Das
So... one solution would be to use a non-Jurassic version of Jackson. 2.6
will drop before too long, and 3.0 is in longer-term planning. The 1.x
series is long deprecated.
If you're genuinely stuck with something ancient, then you need to include
the JAR that contains the class, and 1.9.13 does
Giving a bit more detail on the error would make it a lot easier for others
to help you out. Eg., in this case, it would have helped if included your
actual compile error.
In any case, I'm assuming your issue is b/c that class if private to
spark. You can sneak around that by using
Thanks! :)
Colin McQueen
*Software Developer*
On Thu, Mar 12, 2015 at 3:05 PM, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:
Hi Colin,
my understanding is that this is currently not possible with KafkaUtils.
You would have to write a custom receiver using Kafka's SimpleConsumer API.
I figured out for spark-shell by passing the --master option. However, still
troubleshooting for launching sql queries. My current command is like that:
./bin/beeline -u jdbc:hive2://ms0220:1 -n `whoami` -p ignored -f
tpch_query10.sql
On Thursday, March 12, 2015 10:37 AM, Grandl
KafkaUtils.createDirectStream, added in spark 1.3, will let you specify a
particular topic and partition
On Thu, Mar 12, 2015 at 1:07 PM, Colin McQueen
colin.mcqu...@shiftenergy.com wrote:
Thanks! :)
Colin McQueen
*Software Developer*
On Thu, Mar 12, 2015 at 3:05 PM, Jeffrey Jedele
Can you access the batcher directly? Like is there is there a handle to get
access to the batchers on the executors by running a task on that executor?
If so, after the streamingContext has been stopped (not the SparkContext),
then you can use `sc.makeRDD()` to run a dummy task like this.
Thank you all,
I tried this -Dcodehaus.jackson.version=1.9.13 as suggested by Ted Yu and
it resolved my issue.
Many thanks.
Cheers,
Uthay.
On 12 March 2015 at 17:22, Ted Yu yuzhih...@gmail.com wrote:
Uthay:
You can run mvn dependency:tree command to find out the actual
jackson-core-asl
Why are you repartitioning 1? That would obviously be slow, you are
converting a distributed operation to a single node operation.
Also consider using RDD.top(). If you define the ordering right (based on
the count), then you will get top K across then without doing a shuffle for
sortByKey. Much
Join causes a shuffle (sending data across the network). I expect it will
be better to filter before you join, so you reduce the amount of data which
is sent across the network.
Note this would be true for *any* transformation which causes a shuffle. It
would not be true if you're combining RDDs
I'm trying to use the AWS SDK (v1.9.23) to connect to DynamoDB from within
a Spark application. Spark 1.2.1 is assembled with HttpClient 4.2.6, but
the AWS SDK is depending on HttpClient 4.3.4 for it's communication with
DynamoDB. The end result is an error when the app tries to connect to
I have the same exact error. Am running a pyspark job in yarn-client mode.
Works well in standalone but I need to run it in yarn-client mode.
Other people reported the same problem when bundling jars and extra
dependencies. I'm pointing the pyspark to use a specific python executable
bundled
BTW, I was running tests from SBT when I get the errors. One test turn a
Seq of case class to DataFrame.
I also tried run similar code in the console, but failed with same error.
I tested both Spark 1.3.0-rc2 and 1.2.1 with Scala 2.11.6 and 2.10.4
Any idea?
Jianshi
On Fri, Mar 13, 2015 at
Hi Colin,
my understanding is that this is currently not possible with KafkaUtils.
You would have to write a custom receiver using Kafka's SimpleConsumer API.
https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html
Hi,
I have a streaming application where am doing top 10 count in each window which
seems slow. Is there efficient way to do this.
val counts = keyAndValues.map(x =
math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))
val topCounts =
The error is in the original post.
Here's the recipe that worked for me:
kryo.register(Class.forName(org.roaringbitmap.RoaringArray$Element))
kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]])
kryo.register(classOf[Array[Short]])
Same issue here. But the classloader in my exception is somehow different.
scala.ScalaReflectionException: class
org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with
java.net.URLClassLoader@53298398 of type class java.net.URLClassLoader with
classpath
Jianshi
On Sun, Mar 1, 2015 at
Seems the path is not set correctly. Its looking for C:\\bin\winutils.exe .
You would need to set the path correctly.
On Thu, Feb 26, 2015 at 7:59 PM, sergunok ser...@gmail.com wrote:
Hi!
I downloaded and extracted Spark to local folder under windows 7 and have
successfully played with it in
If you want to use another kafka receiver instead of current spark kafka
receiver, You can see this:
https://github.com/mykidong/spark-kafka-simple-consumer-receiver/blob/master/src/main/java/spark/streaming/receiver/kafka/KafkaReceiverUtils.java
You can handle to get just the stream from the
I am running LogisticRegressionWithLBFGS. I got these lines on my console:
2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch |
Encountered bad values in function evaluation. Decreasing step size to 0.5
2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |
In fact, by activating netlib with native libraries it goes faster.
Thanks
On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
There are a couple of differences between the ml-matrix implementation and
the one used in AMPCamp
- I think the AMPCamp one
On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
In fact, by activating netlib with native libraries it goes faster.
Glad you got it work ! Better performance was one of the reasons we made
the switch.
Thanks
On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman
Hi,
It's been some time since my last message on the subject of using many RDDs
in a Spark job, but I have just encountered the same problem again. The
thing it's that I have an RDD of time tagged data, that I want to 1) divide
into windows according to a timestamp field; 2) compute KMeans for
Hi.
Thanks for your answers, but, to read parquet files is necessary to use
parquetFile method in org.apache.spark.sql.SQLContext, is it true?
How can I combine your solution with the called to this method?
Thanks!!
Regards
On Thu, Mar 12, 2015 at 8:34 AM, Yijie Shen henry.yijies...@gmail.com
Have you considered using the spark-hbase-connector for this:
https://github.com/nerdammer/spark-hbase-connector
On Thu, Mar 12, 2015 at 5:19 AM, Udbhav Agarwal udbhav.agar...@syncoms.com
wrote:
Thanks Akhil.
Additionaly if we want to do sql query we need to create JavaPairRdd, then
Hi
I was running with spark-1.3.0-snapshot
rdd = sc.textFile(hdfs://X.X.X.X/data)
rdd.first()
Then I got this error:
Traceback (most recent call last):
File stdin, line 1, in module
File /pyspark/rdd.py, line 1243, in first
rs = self.take(1)
File /pyspark/rdd.py, line 1195, in take
Hi All,
I am current trying to write out a scheme RDD to avro. I noticed that there
is a databricks spark-avro library and I have included that in my
dependencies, but it looks like I am not able to access the AvroSaver
object. On compilation of the job I get this:
error: not found: value
My jobs frequently run out of memory if the #of cores on an executor is too
high, because each core launches a new parquet decompressor thread, which
allocates memory off heap to decompress. Consequently, even with say 12
cores on an executor, depending on the memory, I can only use 2-3 to avoid
I am using repartitionAndSortWithinPartitions to partition my content and then
sort within each partition. I've also created a custom partitioner that I use
with repartitionAndSortWithinPartitions. I created a custom partitioner as my
key consist of something like 'groupid|timestamp' and I
Hi Adam,
Could you try building spark with profile -Pkinesis-asl.
mvn -Pkinesis-asl -DskipTests clean package
refers to 'Running the Example' section.
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
In fact, I've seen same issue and have been able to use the AWS SDK by
Hi Tristan,
Did upgrading to Kryo3 help?
Thanks,
Soila
On Sun, Mar 1, 2015 at 2:48 PM, Tristan Blakers tris...@blackfrog.org wrote:
Yeah I implemented the same solution. It seems to kick in around the 4B
mark, but looking at the log I suspect it’s probably a function of the
number of unique
Checking the script, it seems spark-daemon.sh unable to stop the worker
$ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1
no org.apache.spark.deploy.worker.Worker to stop
$ ps -elf | grep spark
0 S taoewang 24922 1 0 80 0 - 733878 futex_ Mar12 ? 00:08:54 java
-cp
Does the machine have cron job that periodically cleans up /tmp dir ?
Cheers
On Thu, Mar 12, 2015 at 6:18 PM, sequoiadb mailing-list-r...@sequoiadb.com
wrote:
Checking the script, it seems spark-daemon.sh unable to stop the worker
$ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker
Does Spark support skewed joins similar to Pig which distributes large
keys over multiple partitions? I tried using the RangePartitioner but
I am still experiencing failures because some keys are too large to
fit in a single partition. I cannot use broadcast variables to
work-around this because
Short answer: if you downloaded spark-avro from the repo.maven.apache.org
repo you might be using an old version (pre-November 14, 2014) -
see timestamps at
http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
Lots of changes at https://github.com/databricks/spark-avro since
Nope, I can see the master file exist but not the worker:
$ ls
bitrock_installer.log
hsperfdata_root
hsperfdata_taoewang
omatmp
sbt2435921113715137753.log
spark-taoewang-org.apache.spark.deploy.master.Master-1.pid
在 2015年3月13日,上午9:34,Ted Yu yuzhih...@gmail.com 写道:
Does the machine have cron
Sounds like the way of doing it. Could you try accessing a file from UNC
Path with native Java nio code and make sure it is able access it with the
URI file:10.196.119.230/folder1/abc.txt?
Thanks
Best Regards
On Wed, Mar 11, 2015 at 7:45 PM, Wang, Ningjun (LNG-NPV)
Thank you very much for your detailed response, it was very informative and
cleared up some of my misconceptions. After your explanation, I understand that
the distribution of the data and parallelism is all meant to be an abstraction
to the developer.
In your response you say “When you
Like this?
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result]).cache()
Here's a complete example
Spark 1.3.0 is not officially out yet, so i don't think sbt will download
the hadoop dependencies for your spark by itself. You could try manually
adding the hadoop dependencies yourself (hadoop-core, hadoop-common,
hadoop-client)
Thanks
Best Regards
On Wed, Mar 11, 2015 at 9:07 PM, Patcharee
At the end of foreachrdd i believe.
Thanks
Best Regards
On Thu, Mar 12, 2015 at 6:48 AM, Corey Nolet cjno...@gmail.com wrote:
Given the following scenario:
dstream.map(...).filter(...).window(...).foreachrdd()
When would the onBatchCompleted fire?
Hi
We have a custom build to read directories recursively, Currently we use it
with fileStream like:
val lines = ssc.fileStream[LongWritable, Text,
TextInputFormat](/datadumps/,
(t: Path) = true, true, *true*)
Making the 4th argument true to read recursively.
You could give it a try
Are the files already present in HDFS before you are starting your
application ?
On Thu, Mar 12, 2015 at 11:11 AM, CH.KMVPRASAD [via Apache Spark User List]
ml-node+s1001560n22008...@n3.nabble.com wrote:
Hi am successfully executed sparkPi example on yarn mode but i cant able
to read files
Handling exceptions this way means handling errors on the driver side,
which may or may not be what you want. You can also write functions
with exception handling inside, which could make more sense in some
cases (like, to ignore bad records or count them or something).
If you want to handle
org.apache.spark.deploy.SparkHadoopUtil has a method:
/**
* Get [[FileStatus]] objects for all leaf children (files) under the given
base path. If the
* given path points to a file, return a single-element collection containing
[[FileStatus]] of
* that file.
*/
def
94 matches
Mail list logo