Thanks to everyone for suggestions and explanations.

Currently I've started to experiment with the following scenario, that
seems to work for me:

- Put the properties file on a web server so that it is centrally available
- Pass it to the Spark driver program via --conf 'propertiesFile=http:
//myWebServer.com/mymodule.properties'
- And then load the configuration using Apache Commons Configuration:

    PropertiesConfiguration config = new PropertiesConfiguration();
    config.load(System.getProperty("propertiesFile"));

Using the method described above, I don't need to statically compile my
properties file into the über JAR anymore, I can modify the file on the web
server and when I submit my application via spark-submit, passing the URL
of the properties file, the driver program reads the contents of that file
for once, retrieves the values of the keys and continues.

PS: I've opted for Apache Commons Configuration because it is already part
of the many dependencies I have in my pom.xml, and I did not want to pull
another library, even though I Typesafe Config library seems to be a
powerful and flexible choice, too.

--
Emre



On Tue, Feb 17, 2015 at 6:12 PM, Charles Feduke <charles.fed...@gmail.com>
wrote:

> Emre,
>
> As you are keeping the properties file external to the JAR you need to
> make sure to submit the properties file as an additional --files (or
> whatever the necessary CLI switch is) so all the executors get a copy of
> the file along with the JAR.
>
> If you know you are going to just put the properties file on HDFS then why
> don't you define a custom system setting like "properties.url" and pass it
> along:
>
> (this is for Spark shell, the only CLI string I have available at the
> moment:)
>
> spark-shell --jars $JAR_NAME \
>     --conf 'properties.url=hdfs://config/stuff.properties' \
>     --conf
> 'spark.executor.extraJavaOptions=-Dproperties.url=hdfs://config/stuff.properties'"
>
> ... then load the properties file during initialization by examining the
> properties.url system setting.
>
> I'd still strongly recommend Typesafe Config as it makes this a lot less
> painful, and I know for certain you can place your *.conf at a URL (using
> the -Dconfig.url=) though it probably won't work with an HDFS URL.
>
>
>
> On Tue Feb 17 2015 at 9:53:08 AM Gerard Maas <gerard.m...@gmail.com>
> wrote:
>
>> +1 for TypeSafe config
>> Our practice is to include all spark properties under a 'spark' entry in
>> the config file alongside job-specific configuration:
>>
>> A config file would look like:
>> spark {
>>      master = ""
>>      cleaner.ttl = 123456
>>      ...
>> }
>> job {
>>     context {
>>         src = "foo"
>>         action = "barAction"
>>     }
>>     prop1 = "val1"
>> }
>>
>> Then, to create our Spark context, we transparently pass the spark
>> section to a SparkConf instance.
>> This idiom will instantiate the context with the spark specific
>> configuration:
>>
>>
>> sparkConfig.setAll(configToStringSeq(config.getConfig("spark").atPath("spark")))
>>
>> And we can make use of the config object everywhere else.
>>
>> We use the override model of the typesafe config: reasonable defaults go
>> in the reference.conf (within the jar). Environment-specific overrides go
>> in the application.conf (alongside the job jar) and hacks are passed with
>> -Dprop=value :-)
>>
>>
>> -kr, Gerard.
>>
>>
>> On Tue, Feb 17, 2015 at 1:45 PM, Emre Sevinc <emre.sev...@gmail.com>
>> wrote:
>>
>>> I've decided to try
>>>
>>>   spark-submit ... --conf
>>> "spark.driver.extraJavaOptions=-DpropertiesFile=/home/emre/data/myModule.properties"
>>>
>>> But when I try to retrieve the value of propertiesFile via
>>>
>>>    System.err.println("propertiesFile : " +
>>> System.getProperty("propertiesFile"));
>>>
>>> I get NULL:
>>>
>>>    propertiesFile : null
>>>
>>> Interestingly, when I run spark-submit with --verbose, I see that it
>>> prints:
>>>
>>>   spark.driver.extraJavaOptions ->
>>> -DpropertiesFile=/home/emre/data/belga/schemavalidator.properties
>>>
>>> I couldn't understand why I couldn't get to the value of
>>> "propertiesFile" by using standard System.getProperty method. (I can use
>>> new SparkConf().get("spark.driver.extraJavaOptions")  and manually parse
>>> it, and retrieve the value, but I'd like to know why I cannot retrieve that
>>> value using System.getProperty method).
>>>
>>> Any ideas?
>>>
>>> If I can achieve what I've described above properly, I plan to pass a
>>> properties file that resides on HDFS, so that it will be available to my
>>> driver program wherever that program runs.
>>>
>>> --
>>> Emre
>>>
>>>
>>>
>>>
>>> On Mon, Feb 16, 2015 at 4:41 PM, Charles Feduke <
>>> charles.fed...@gmail.com> wrote:
>>>
>>>> I haven't actually tried mixing non-Spark settings into the Spark
>>>> properties. Instead I package my properties into the jar and use the
>>>> Typesafe Config[1] - v1.2.1 - library (along with Ficus[2] - Scala
>>>> specific) to get at my properties:
>>>>
>>>> Properties file: src/main/resources/integration.conf
>>>>
>>>> (below $ENV might be set to either "integration" or "prod"[3])
>>>>
>>>> ssh -t root@$HOST "/root/spark/bin/spark-shell --jars /root/$JAR_NAME \
>>>>     --conf 'config.resource=$ENV.conf' \
>>>>     --conf
>>>> 'spark.executor.extraJavaOptions=-Dconfig.resource=$ENV.conf'"
>>>>
>>>> Since the properties file is packaged up with the JAR I don't have to
>>>> worry about sending the file separately to all of the slave nodes. Typesafe
>>>> Config is written in Java so it will work if you're not using Scala. (The
>>>> Typesafe Config also has the advantage of being extremely easy to integrate
>>>> with code that is using Java Properties today.)
>>>>
>>>> If you instead want to send the file separately from the JAR and you
>>>> use the Typesafe Config library, you can specify "config.file" instead of
>>>> ".resource"; though I'd point you to [3] below if you want to make your
>>>> development life easier.
>>>>
>>>> 1. https://github.com/typesafehub/config
>>>> 2. https://github.com/ceedubs/ficus
>>>> 3.
>>>> http://deploymentzone.com/2015/01/27/spark-ec2-and-easy-spark-shell-deployment/
>>>>
>>>>
>>>>
>>>> On Mon Feb 16 2015 at 10:27:01 AM Emre Sevinc <emre.sev...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm using Spark 1.2.1 and have a module.properties file, and in it I
>>>>> have non-Spark properties, as well as Spark properties, e.g.:
>>>>>
>>>>>    job.output.dir=file:///home/emre/data/mymodule/out
>>>>>
>>>>> I'm trying to pass it to spark-submit via:
>>>>>
>>>>>    spark-submit --class com.myModule --master local[4] --deploy-mode
>>>>> client --verbose --properties-file /home/emre/data/mymodule.properties
>>>>> mymodule.jar
>>>>>
>>>>> And I thought I could read the value of my non-Spark property, namely,
>>>>> job.output.dir by using:
>>>>>
>>>>>     SparkConf sparkConf = new SparkConf();
>>>>>     final String validatedJSONoutputDir =
>>>>> sparkConf.get("job.output.dir");
>>>>>
>>>>> But it gives me an exception:
>>>>>
>>>>>     Exception in thread "main" java.util.NoSuchElementException:
>>>>> job.output.dir
>>>>>
>>>>> Is it not possible to mix Spark and non-Spark properties in a single
>>>>> .properties file, then pass it via --properties-file and then get the
>>>>> values of those non-Spark properties via SparkConf?
>>>>>
>>>>> Or is there another object / method to retrieve the values for those
>>>>> non-Spark properties?
>>>>>
>>>>>
>>>>> --
>>>>> Emre Sevinç
>>>>>
>>>>
>>>
>>>
>>> --
>>> Emre Sevinc
>>>
>>
>>


-- 
Emre Sevinc

Reply via email to